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Preface 



The 12th Australian Joint Conference on Artificial Intelligence (AI’99) held in 
Sydney, Australia, 6-10 December 1999, is the latest in a series of annual re- 
gional meetings at which advances in artificial intelligence are reported. This 
series now attracts many international papers, and indeed the constitution of 
the program committee reflects this geographical diversity. Besides the usual 
tutorials and workshops, this year the conference included a companion sympo- 
sium at which papers on industrial applications were presented. The symposium 
papers have been published in a separate volume edited by Eric Tsui. AI’99 is 
organized by the University of New South Wales, and sponsored by the Aus- 
tralian Computer Society, the Commonwealth Scientific and Industrial Research 
Organisation (CSIRO), Computer Sciences Corporation, the KRRU group at 
Griffith University, the Australian Artificial Intelligence Institute, and Neuron- 
Works Ltd. 

AI’99 received over 120 conference paper submissions, of which about one- 
third were from outside Australia. Prom these, 39 were accepted for regular 
presentation, and a further 15 for poster display. These proceedings contain the 
full regular papers and extended summaries of the poster papers. All papers were 
refereed, mostly by two or three reviewers selected by members of the program 
committee, and a list of these reviewers appears later. 

The technical program comprised two days of workshops and tutorials, fol- 
lowed by three days of conference and symposium plenary and paper sessions. 
Distinguished plenary speakers were invited to share their experience and exper- 
tise in AI with our delegates. Some workshops (Knowledge Acquisition, Com- 
monsense Reasoning, AI Applications to Plant and Animal Production) also had 
their own invited speakers. 

The plenary speakers were: Paul Beinat (NeuronWorks), Ussama Fayed (Mi- 
crosoft Research), Michael Georgeff (Australian Artificial Intelligence Institute), 
Hiroshi Motoda (Osaka University), and Eugene C. Preuder (University of New 
Hampshire). The workshop speakers were: Takahira Yamaguchi (Shizuoka Uni- 
versity) in Knowledge Acquisition, John McCarthy (Stanford University) and 
Michael Thielscher (Dresden University) in Commonsense Reasoning, and M.H. 
Rasmy (University of Cairo) in AI Applications in Plant and Animal Produc- 
tion. We thank our sponsors and the workshop organizers for supporting their 
visit. 

The smooth running of AF99 was largely due to its program committee, the 
administrative staff, the reviewers, the conference organizers CIM Pty Ltd, and 
the managers of the venue Coogee Holiday Inn. 

We thank Springer- Verlag and its representative Alfred Hofmann for efficient 
assistance in producing these proceedings of AT99 as a volume in Lecture Notes 
in Artificial Intelligence series. 



October 1999 



Norman Foo 
the Conference Chair 
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Generating Rule Sets from Model Trees 



Geoffrey Holmes, Mark Hall, and Elbe Prank 



Depeirtment of Computer Science 
University of Waikato, New Zeadand 
Phone +64 7 838-4405 
{geof f , mhall . eibe}Scs . waikato . ac . nz 



Abstract. Model trees — decision trees with linear models at the leaf 
nodes — have recently emerged as an aiccurate method for numeric pre- 
diction that produces understandable models. However, it is known that 
decision lists — ordered sets of If-Then rules — have the potential to be 
more compact and therefore more understandable than their tree coun- 
terparts. 

We present an algorithm for inducing simple, accurate decision hsts from 
model trees. Model trees are built repeatedly and the best rule is selected 
at each iteration. This method produces rule sets that are as accurate but 
smaller than the model tree constructed from the entire dataset. Experi- 
mental results for various heuristics which attempt to find a compromise 
between rule accuracy and rule coverage are reported. We show that our 
method produces comparably akcciurate emd smaller rule sets than the 
commercial state-of-the-art rule leeiming system Cubist. 



1 Introduction 

Recent work in knowledge discovery on time series data [3], indicates that the 
scope of application of machine learning algorithms has gone beyond the rela- 
tively “straightforward” classification of nominal attributes in data. These appli- 
cations are important to business, medicine, engineering and the social sciences, 
particularly in areas concerned with understanding data from sensors [8]. 

Of equal importance, particularly for business applications is the prediction, 
and consequent interpretation, of numeric values. For example, the 1998 KDD- 
Cup concentrated on predicting whether or not someone would donate to a 
charity. It is arguable that the charity would like to know both the amount 
someone is likely to donate and the factors which determine this donation from 
historical data so that they can produce a more effective marketing campaign. 

Predicting numeric values usually involves complicated regression formulae. 
However, in machine learning it is important to present results that can be eas- 
ily interpreted. Decision lists presented in the If-Then rule format are one of 
the most popular description languages used in machine learning. They have 
the potential to be more compact and more predictive than their tree counter- 
parts [16]. In any application, the desired outcome is a small descriptive model 
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which has strong predictive capability. It has to be small to be interpretable and 
understandable, and it has to be accurate so that generalization capabilities can 
be attributed to the model. 

In this paper we present a procedure for generating rules from model 
trees [10], based on the basic strategy of the PART algorithm [4], that produces 
accurate and compact rule sets. Section 2 discusses the motivation for PART 
and alternative approaches to continuous class prediction. Section 3 describes 
the adaptation of PART to model trees. Section 4 presents an experimental 
evaluation on standard datasets. We compare the accuracy and size of the rule 
sets of our procedure with model trees and the rule-based regression learner Cu- 
bist^, the commercial successor of M5 [10]. Section 5 concludes with a discussion 
of the results and areas for further research on this problem. 

2 Related Work 

Rule learning for classification systems normally operates in two-stages. Rules 
are induced initially and then refined at a later stage using a complex global 
optimization procedure. This is usually accomplished in one of two ways; either 
by generating a decision tree, mapping the tree to a rule set and then refining 
the rule set based on boundary considerations of the coverage achieved by each 
rule, or by employing the separate-and-conquer paradigm. As with decision trees 
this strategy usually employs a rule optimization stage. 

Frank and Witten (1998) combined these two approaches in an algorithm 
called PART (for partial decision trees) in order to circumvent problems that 
can arise with both these techniques. Rules induced from decision trees are com- 
putationally expensive and this expense can grow alarmingly in the presence of 
noise [2], while separate-and-conquer methods suffer from a form of overpruning 
called “hasty generalization” [4]. 

PART works by building a rule and removing its cover, as in the separate- 
and-conquer technique, repeatedly until all the instances are covered. The rule 
construction stage differs from standard separate-and-conquer methods because 
a partial pruned decision tree is built for a set of instances, the leaf with the 
largest coverage is made into a rule, and the tree is discarded. The pruned deci- 
sion tree helps to avoid the overpruning problem of methods that immediately 
prune an individual rule after construction. Also, the expensive rule optimiza^ 
tion stages associated with decision tree rule learning are not performed. Results 
on standard data sets show smaller rule sizes with no loss in accuracy when 
compared with the decision tree learner C4.5 [11] and greater accuracy when 
compared with the separate-and-conquer rule learner RIPPER [2]. In this paper 
we adapt the basic procedure of PART to continuous class prediction to examine 
whether similar results can be obtained, namely smaller rule sets with no loss in 
accuracy. 

Although the literature is light in the area of rule-based continuous class 
prediction, a taxonomy can be found. A first split can be made on whether a 

A test version of Cubist is available from http://www.rulequest.com 
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technique generates interpretable results. Those that do not include neural net- 
works, and various statistical approaches at dealing with non-linear regression, 
such as MARS [5] and projection pursuit [6]. Those that do produce readable 
output are further split on whether or not they are based on the two major 
paradigms for rule generation — rule sets represented as regression or model trees, 
and the separate-and-conquer rule-learning approach. Examples from the regres- 
sion tree family include: CART [1], RETIS [7] and M5 [10]. Separate-and-conquer 
methods include a system that maps a regression problem into a classification 
problem [16], and a propositional learning system [14]. 

3 Generating Rules From Model Trees 

Model trees [10] are a technique for dealing with continuous class problems that 
provide a structural representation of the data and a piecewise linear fit of the 
class. They have a conventional decision tree structure but use linear functions 
at the leaves instead of discrete class labels. The first implementation of model 
trees, M5, was rather abstractly defined in [10] and the idea was reconstructed 
and improved in a system called MS' [15]. Like conventional decision tree learn- 
ers, MS' builds a tree by splitting the data based on the values of predictive 
attributes. Instead of selecting attributes by an information theoretic metric, 
MS' chooses attributes that minimise intra-subset variation in the class values 
of instances that go down each branch. 

After constructing a tree, MS' computes a linear model for each node; the 
tree is then pruned back from the leaves, so long as the expected estimated 
error decreases. The expected error for each node is calculated by averaging the 
absolute difference between the predicted value and the actual class value of 
each training example that reaches the node. To compensate for an optimistic 
expected error from the training data, this average is multiplied by a factor that 
takes into account the number of training examples that reach the node and the 
number of parameters in the model that represent the class value at that node. 

This process of tree construction can lead to sharp discontinuities occurring 
between adjacent linear models at the leaves of the pruned tree. A procedure 
called smoothing is used to compensate for these differences. The smoothing 
procedure computes a prediction using the leaf model, and then passes that 
value along the path back to the root, smoothing it at each node by combining 
it with the value predicted by the linear model for that node. 



3.1 Rule Generation 

The method for generating rules from model trees, which we call MS'Rules, is 
straightforward and works as follows: a tree learner (in this case model trees) is 
applied to the full training dataset and a pruned tree is learned. Next, the best 
leaf (according to some heuristic) is made into a rule and the tree is discarded. 
All instances covered by the rule are removed from the dataset. The process is 
applied recursively to the remaining instances and terminates when all instances 
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are covered by one or more rules. This is the basic separate-and-conquer strategy 
for learning rules; however, instead of building a single rule, as it is done usually, 
we build a full model tree at each stage, and make its “best” leaf into a rule. 
This avoids potential for over-pruning called hasty generalization [4]. In contrast 
to PART, which employs the same strategy for categorical prediction, M5'Rules 
builds full trees instead of partially explored trees. Building partial trees leads 
to greater computational efficiency, and does not affect the size and accuracy of 
the resulting rules. 

This paper concentrates on generating rules using unsmoothed linear models. 
Because the tree from which a rule is generated is discarded at each stage, 
smoothing for rules would have to be done as a post processing stage after 
the full set of rules has been produced. This process is more complicated than 
smoothing model trees — it would involve determining the boundaries between 
rules and then installing linear models to smooth over them. 




Linear models: 

LMl: TZOBolt = 0 + 2Time 
LM2: T20Bolt = 5.26 + 0.465Time 
LM3: T20Bolt = 74.9 



MS’ Rules, 

Rules found by max coverage. 

Rule 1: 

Time <= 32.2 
Total > 15 

->T20Bolt = 5.26 + 0.465Time [16/5.44%] 

Rule 2: 

Time <= 23.2 

->T20Bolt = 0 + 2Time [12/0%] 

Rule 3: 

-> T20Bolt = 76 - 0.214Run + 0.914Time [12/38.7%] 



Fig. 1. Model tree and rules for the bolts dataset. 



3.2 Rule Selection Heuristics 

So far we have described a general approach to extracting rules from trees, 
applicable to either classification or regression. It remains to determine, at each 
stage, which leaf in the tree is the best candidate for addition to the rule set. The 
most obvious approach [4] is to choose the leaf which covers the most examples. 
Figure 1 shows a tree produced by M5' and the rules generated by MS'Rules 
using the coverage heuristic for the dataset bolts [13]. The values at the leaves 
of the tree and on the consequent of the rules are the coverage and percent root 
mean squared error respectively for instances that reach those leaves (satisfy 
those rules). 
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Note that the first rule will always map directly to one branch of the tree, 
however, subsequent rules often do not. In Figure 1, Rule 1 and LM2 are identical 
as are Rule 2 and LMl, however, Rule 3 and LM3 are very different. 

We have experimented with three other heuristics, designed to identify accu- 
rate rules and to trade off accuracy against coverage. These measures are similar 
to those used in the separate-and-conquer procedure when evaluating the spe- 
cialization of one rule from another [14]. 

The first of these calculates the percent root mean squared error as shown 
in Equation 1: 



% RMS = 






( 1 ) 



where Yi is the actual class value for example i, j/t is the class value predicted 
by the linear model at a leaf. Nr is the number of examples covered by leaf, Y 
is the mean of the class values, and N is the total number of examples. In this 
case, small values of % RMS (less than 1) indicate that the model at a leaf is 
doing better than simply predicting the mean of the class values. 

One potential problem with percent root mean squared error is that it may 
favour accuracy at the expense of coverage. Equations 2 and 3 show two heuris- 
tic measures designed to trade off accuracy against coverage. The first, simply 
normalises the mean absolute error at a leaf using the number of examples it 
covers; the second, multiplies the correlation between the predicted and actual 
class values for instances at a leaf by the number of instances that reach the leaf. 



MAE / Cover = 

^ly 1' 



(2) 



V *■ Yv- 

CC X Cover = x Nr. 



(3) 



NrCyCTy 

In Equation 3, Yi and yi are the actual value and predicted value for instance i 
expressed as deviations from their respective means. 



4 Experimental Results 

In order to evaluate the performance of MS'Rules on a diverse set of machine 
learning problems, experiments were performed using thirty continuous class 
datasets. The datasets and their properties are listed in Table 1, and can be 
obtained from the authors upon request. Nineteen of these datasets were used 
by Kilpatrick and Cameron- Jones [9], six are from the StatLib repository [13], 
and the remaining five were collected by Simonoff [12]. 

As well as MS'Rules using each of the rule-selection heuristics described 
above, MS' (with unsmoothed linear models) and the commercial regression 
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Table 1. Continuous class datasets used in the experiments 



Dateiset 
values {%) 


Instances 

attributes 


Missing 

attributes 


Numeric 


Nominal 


auto93 


93 


0.7 


16 


6 


autoHorse 


205 


1.1 


17 


8 


autoMpg 


398 


0.2 


4 


3 


autoPrice 


159 


0.0 


15 


0 


betskball 


96 


0.0 


4 


0 


bodyfat 


252 


0.0 


14 


0 


breastTumor 


286 


0.3 


1 


8 


cholesterol 


303 


0.1 


6 


7 


Cleveland 


303 


0.1 


6 


7 


cloud 


108 


0.0 


4 


2 


cpu 


209 


0.0 


6 


1 


echoMonths 


131 


7.5 


6 


3 


elusage 


55 


0.0 


1 


1 


hshcatch 


158 


6.9 


5 


2 


housing 


506 


0.0 


12 


1 


hungarian 


294 


19.0 


6 


7 


lowbwt 


189 


0.0 


2 


7 


mbagrade 


61 


0.0 


1 


1 


meta 


528 


4.3 


19 


2 


pbc 


418 


15.6 


10 


8 


pharynx 


195 


0.1 


1 


10 


pollution 


60 


0.0 


15 


0 


pwLinear 


200 


0.0 


10 


0 


quake 


2178 


0.0 


3 


0 


sensory 


576 


0.0 


0 


11 


servo 


167 


0.0 


0 


4 


sleep 


62 


2.4 


7 


0 


strike 


625 


0.0 


5 


1 


veteran 


137 


0.0 


3 


4 


vineyard 


52 


0.0 


3 


0 



rule learning system Cubist were run on all the datasets. Default parameter set- 
tings were used for all algorithms. The mean absolute error, averaged over ten 
ten-fold cross-validation runs and the standard deviations of these ten error es- 
timates were calculated for each algorithm-dataset combination. The same folds 
were used for each algorithm. 

Table 2 compares the results for MS'Rules with those for M5' unsmoothed. 
Results for MS'Rules are marked with a o if they show a significant improvement 
over the corresponding results for MS', and with a • if they show a significant 
degradation. Results marked with a yj show where MS'Rules has produced sig- 
nificantly fewer rules than MS'; those marked with a x show where MS'Rules has 
produced significantly more rules than MS'. Results are considered “significant” 
if the difference is statistically significant at the 1% level according to a paired 
two-sided t-test, each pair of data points consisting of the estimates obtained in 
one ten-fold cross-validation run for the two learning algorithms being compared. 
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Table 2. Experimental results: comparing M5'Rules with M5'. The values are 
mean absolute error averaged over ten ten-fold cross-validation runs. Results are 
marked with a o if they show a significant improvement over M5' unsmoothed, 
and with a • if they show a significant degradation. Results marked with a 
show where MS' Rules has produced significantly fewer rules than MS'; those 
marked with a x show where MS' Rules has produced significantly more rules 
than MS'. 



Dataset 


MS’ 

Unsmoothed 


MS'R 
% RMS 




MS^R 

MAE/Cover 




MS^R 
CCx Cover 




MS'R 

Cover 




auto93 


3.66±0.2 


3.66±0.2 




3.66±0.2 




3.66±0.2 




3.66±0.2 




autoHorse 


8.97±0.5 


9.44±0.S 


V 


9.36±0.5 




9.40±0.5 


•x/ 


9.32±0.S 


x/ 


autoMpg 


2.08±0.0 


2.10±0.1 


V 


2.08±0.1 


x/ 


2.08±0.0 


x/ 


2.08±0.0 




autoPrice 


1522.96±53.2 1636.90±96.6> 


•V 


1655.50±109.9«v’ 


1650.81±129.0 


V1637.44±124.7»V 


baskball 


0.07±0.0 


0.07±0.0 




0.07±0.0 




0.07±0.0 




0.07±0.0 




bodyfat 


0.37±0.1 


0.40±0.0 


X 


0.38±0.1 


X 


0.37±0.1 




0.36±0.1 




breastTumor 


8.06±0.1 


8.06±0.1 




8.06±0.1 




8.06±0.1 




8.06±0.1 




cholesterol 


40.98±1.4 


40.91±1.4 




40.99±1.4 




40.77±1.4 




40.98±1.4 




Cleveland 


0.66±0.0 


0.65±0.0 




0.66±0.0 




0.66±0.0 




0.66±0.0 




cloud 


0.29±0.0 


0.28±0.0 




0.28±0.0 




0.29±0.0 




0.29±0.0 




cpu 


13.40±1.2 


13.31±1.3 




13.33±1.3 




13.18±1.5 




13.27±1.4 




echoMonths 


8.90±0.1 


8.90±0.1 




8.90±0.1 




8.90±0.1 




8.90±0.1 




elusage 


9.57±0.6 


9.57±0.6 




9.57±0.6 




9.57±0.6 




9.S7±0.6 




fishcatch 


38.70±1.6 


39.47±1.5 


V 


41.55±1.5 1 


w 


38.53±1.9 


x/ 


38.61±1.8 


x/ 


housing 


2.75±0.2 


2.64±0.1 


V 


2.71±0.2 


V 


2.71±0.1 


s/ 


2.77±0.1 


V 


hungarian 


0.28±0.0 


0.28±0.0 


x/ 


0.28±0.0 


V 


0.28±0.0 




0.28±0.0 




lowbwt 


370.93±6.7 


370.93±6.7 




370.93±6.7 




370.93±6.7 




370.57±6.4 




mbagrade 


0.23±0.0 


0.23±0.0 




0.23±0.0 




0.23±0.0 




0.23±0.0 




meta 


115.73±13.3 


123.82±24.5 


x/ 


135.33±22.8 


s/ 


131.29±12.8 


V 


127.72±2S.l 


x/ 


pbc 


716.13±12.8 


715.67±12.2 




716.13±12.8 




716.13±12.8 




716.13±12.8 




pharynx 


352.85±5.8 


352.66±6.1 




351.82±7.5 




3S2.76±7.9 




353.24±5.9 




pollution 


35.15±2.0 


35.15±2.0 




35.03±2.1 




34.99±2.1 




35.03±2.1 




pwLinear 


l.lSiO.O 


l.lSiO.O 




l.lSiO.O 




l.lSiO.O 




l.lSiO.O 




quake 


O.lSiO.O 


O.lSiO.O 


V 


O.lSiO.O 




O.lSiO.O 


•V 


O.lSiO.O 




sensory 


0.58±0.0 


0.58±0.0 


x/ 


0.58±0.0 




0.59±0.0 


x/ 


O.S8±0.0 


x/ 


servo 


0.31±0.0 


0.32±0.0 


x/ 


0.32±0.0 


V 


0.32±0.0 


V 


0.32±0.0 


V 


sleep 


2.56±0.1 


2.56±0.1 




2.56±0.1 




2.56±0.1 




2.S6±0.1 




strike 


215.87±7.1 


231.12±9.7 1 


•x/ 


220.14±4.9 


s/ 


222.95±6.4 


x/ 


214.91±7.4 


V 


veteran 


92.06±4.3 


90.48±4.8 




90.49±4.8 




90.91±4.5 




91.S2±4.7 




vineyard 


2.48±0.1 


2.51±0.2 


x/ 


2.51±0.1 


x/ 


2.43±0.1 


x/ 


2.S1±0.1 


x/ 



>x) statistically significant improvement or degradation 



Prom Table 2 it can be seen that all four heuristic methods for choosing rules 
give results that are rarely significantly worse than MS'. In fact, choosing rules 
simply by coverage gives an excellent result — accuracy on only one dataset is 
significantly degraded. Each of the remaining three heuristics degrade accuracy 
on two datasets. 

As well as accuracy, the size of the rule set is important because it has 
a strong infiuence on comprehensibility. Correlation times coverage and plain 
coverage never result in a larger rule set than MS'. These two heuristics reduce 
the size of the rule set on eleven, and ten datasets respectively. Both percent 
root mean squared error and mean absolute error over cover increase the size of 
the rule set on one dataset, while decreasing size on twelve and eleven datasets 
respectively. 
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Table 3. Experimental results: comparing accuracy of MS'Rules with Cubist. 
The values are mean absolute error averaged over ten ten- fold cross-validation 
runs. Results are marked with a o if they show a significant improvement over 
Cubist, and with a • if they show a significant degradation (the precision of the 
results shown in the table is such that some appear identical but in fact are 
significantly different, e.g. baskball). 



Dataset 


Cubist 


M5'R 
% RMS 




Sis'! 

MAE/Cover 




MS^R 
CC X Cover 




SiFr 

Cover 




auto93 


4.07±0.2 


3.66±0.2 


o 


3.66±0.2 


o 


3.66±0.2 


0 


3.66±0.2 


o 


autoHorse 


9.27±0.5 


9.44±0.5 




9.36±0.5 




9.40±0.5 




9.32±0.5 




autoMpg 


2.24±0.1 


2.10±0.1 


o 


2.08±0.1 


o 


2.08±0.0 


o 


2.08±0.0 


0 


autoPrice 


1639.12±63.8 


1636.90±96.6 




1655.50±109.9 




1650.81±129.0 




1637.4±124.74 




baskball 


0.07±0.0 


0.07±0.0 


o 


0.07±0.0 


0 


0.07±0.0 


o 


0.07±0.0 


0 


bodyfat 


0.33±0.0 


0.40±0.0 


• 


0.38±0.1 




0.37±0.1 




0.36±0.1 




breastTumor 


8.97±0.1 


8.06±0.1 


o 


8.06±0.1 


o 


8.06±0.1 


o 


8.06±0.1 


o 


cholesterol 


43.02±1.5 


40.91±1.4 


o 


40.99±1.4 




40.77±1.4 


0 


40.98±1.4 




Cleveland 


0.65±0.0 


0.65±0.0 




0.66±0.0 




0.66±0.0 




0.66±0.0 




cloud 


0.26±0.0 


0.28±0.0 


• 


0.28±0.0 


• 


0.29±0.0 


• 


0.29±0.0 


• 


cpu 


10.96±1.1 


13.31±1.3 


• 


13.33±1.3 


• 


13.18±1.5 


• 


13.27±1.4 


• 


echoMonths 


9.41±0.2 


8.90±0.1 


o 


8.90±0.1 


0 


8.90±0.1 


o 


8.90±0.1 


o 


elusage 


7.59±0.2 


9.57±0.6 


• 


9.57±0.6 


• 


9.57±0.6 


• 


9.57±0.6 


• 


fishcatch 


41.66±0.8 


39.47±1.5 


o 


41.55±1.5 




38.53±1.9 


0 


38.61±1.8 


0 


housing 


2.37±0.1 


2.64±0.1 


• 


2.71±0.2 


• 


2.71±0.1 


• 


2.77±0.1 


• 


hungarian 


0.23±0.0 


0.28±0.0 


• 


0.28±0.0 


• 


0.28±0.0 


• 


0.28±0.0 


• 


lowbwt 


340.29±7.2 


370.93±6.7 


• 


370.93±6.7 


• 


370.93±6.7 


• 


370.57±6.4 


• 


mbagrade 


0.23±0.0 


0.23±0.0 




0.23±0.0 




0.23±0.0 




0.23±0.0 




meta 


107.26±9.8 


123.82±24.5 




135.33±22.8 


• 


131.29±12.8 


• 


127.72±25.1 


0 


pbc 


774.76±16.3 


715.67±12.2 


o 


716.13±12.8 


0 


716.13±12.8 


o 


716.13±12.8 


o 


pharynx 


448.93±2,7 


352.66±6.1 


o 


351.82±7.5 


o 


352.76±7.9 


o 


353.24±5.9 


o 


pollution 


34.68±2.4 


35.15±2.0 




35.03±2.1 




34.99±2.1 




35.03±2.1 




pwLinear 


1.14±0.0 


1.15±0.0 




1.15±0.0 




l.lSiO.O 




1.15±0.0 




quake 


O.lSiO.O 


O.lSiO.O 




0.15±0,0 




O.lSiO.O 




0.15±0.0 




sensory 


0.61±0.0 


0.58±0.0 


o 


0.58±0.0 


o 


0.59±0.0 


o 


0.58±0.0 


o 


servo 


0.38±0.0 


0.32±0.0 


o 


0.32±0.0 


o 


0.32±0.0 


o 


0.32±0.0 


o 


sleep 


2.84±0.2 


2.56±0.1 


o 


2.56±0.1 


o 


2.56±0.1 


o 


2.56±0.1 


o 


strike 


201.31±5.0 


231.12±9.7 


• 


220.14±4.9 


• 


222.95±6.4 


• 


214.91±7.4 


• 


veteran 


88.76±5.5 


90.48±4.8 




90.49±4.8 




90.91±4.5 




91.52±4.7 




vineyard 


2.28±0.1 


2.51±0.2 


• 


2.51±0.1 


0 


2.43±0.1 


• 


2.51±0.1 


• 



0,0 statistically significant impr^ement or degradation 



Table 3 compares accuracy for MS'Rules with those for Cubist. Table 4 and 
Table 5 compare the average number of rules produced and average number of 
conditions per rule set respectively. The results for both accuracy and number 
of rules — as well as Table 2 — are summarised for quick comparison in Table 6. 
Each entry in Table 6 has two values: the first indicates the number of datasets 
for which the method associated with its column is significantly more accurate 
than the method associated with its row; the second (in braces) indicates the 
number of datasets for which the method associated with its column produces 
significantly smaller rule sets than the method associated with its row. 

From the first row and the first colunm of Table 6 it can be noted that all four 
versions of Mb'Rules — as well as (perhaps surprisingly) MS' — outperform Cubist 
on more datasets than they are outperformed by Cubist. % RMS emd CC x Cover 
are more accurate than Cubist on twelve datasets, Cover on eleven datasets and 
MAE / Cover on ten datasets. By comparison, Cubist does better than all four 
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Table 4. Experimental results: number of rules produced by MS'Rules compared 
with number of rules produced by Cubist 



Dataset 


Cubist 


M5'R 
% RMS 


M5'R M5'R 

MAE / Cover C C X Cove r 


M5'R 

Cover 




auto93 


2.92±0.2 


1.08±0.1 o 


1.08±0.1 o 


1.08i0.1 0 


1.08i0.1 


o 


autoHorse 


5.31±0.3 


2.12±0.6 o 


2.20±0.6 o 


2.83i0.7 o 


2.79i0.5 


o 


autoMpg 


6.21±0.5 


3.41±0.4 o 


3.87±0.4 0 


3.94i0.4 o 


3.91i0.4 


o 


autoPrice 


3.28±0.2 


4.88±0,5 • 


4.70±0.4 • 


4.75i0.5 • 


4.24i0.4 


• 


baskball 


5.17±0.2 


l.OOiO.O o 


l.OOiO.O 0 


l.OOiO.O 0 


l.OOiO.O 


0 


body fat 


1.38±0.2 


3.97±0.6 • 


3.73±0.4 • 


3.58i0.4 • 


3.51i0.4 


• 


breastTumor 


22.19±0.6 


1.06±0.1 o 


1.06±0.1 o 


i.oeio.i o 


i.oeio.i 


o 


cholesterol 


18.63±0.8 


2.33±0.4 o 


2.45±0.5 o 


2.08i0.4 o 


2.46i0.5 


0 


Cleveland 


8.27±0.8 


1.07±0.1 o 


1.06±0.1 o 


1.16i0.3 0 


1.18i0.3 


o 


cloud 


1.09±0.1 


2.62±0.5 • 


2.55±0.4 • 


2.60i0.4 • 


2.63i0.4 


• 


cpu 


2.00±0.0 


2.74±0.2 • 


2.72±0.2 • 


2.70i0.2 • 


2.71i0.2 


• 


echoMonths 


6.24±0.4 


l.OQiO.O o 


l.OOiO.O o 


l.OOiO.O o 


l.OOiO.O 


o 


elusage 


2.00±0.0 


1.62±0.2 o 


1.62±0.2 o 


1.62i0.2 o 


1.62i0.2 


0 


fishcatch 


2.00±0.0 


3.47±0.3 • 


2.87±0.4 • 


3.63i0.3 • 


3.63i0.3 


• 


housing 


6.90±0.4 


9.61±1.6 • 


8.44±0.8 • 


8.32i0.8 • 


8.57i0.7 


• 


hungarian 


9.15±0.5 


1.56±0.2 o 


1.56±0.2 0 


1.65i0.2 o 


1.69i0.3 


o 


lowbwt 


6.45±0.2 


l.Q5±Q.l o 


l.OSiO.l 0 


l.OSiO.l o 


1.04i0.1 


0 


mbagrade 


3.57±0.1 


l.OOiO.O o 


l.OOiO.O o 


l.OOiO.O o 


l.OOiO.O 


o 


meta 


12.90±0.3 


5.40±0.4 o 


5.00±0.4 0 


5.40i0.6 0 


4.66i0.5 


0 


pbc 


21.63±0.9 


1.63±0.1 o 


1.64±0.1 o 


i.esio.i 0 


1.64i0.1 


0 


pharynx 


7.96±0.1 


2.07±0.5 o 


2.20±0.5 0 


2.16i0.4 o 


2.18i0.4 


0 


pollution 


1.53±0.2 


1.22±0.1 o 


1.19±0.1 0 


1.20i0.1 0 


1.21i0.1 


0 


pwLinear 


2.00±0.0 


2.00±0.0 


2.00±0.0 


2.00i0.0 


2.00i0.0 




quake 


4.53±0.9 


2.45±0.3 o 


3.61±0.4 


1.98i0.2 0 


3.66i0.4 




sensory 


45.31±1.1 


4.19±0.5 0 


4.04±0.4 0 


3.97i0.3 0 


4.13i0.5 


0 


servo 


5.77±0.1 


5.05±0.3 0 


5.20±0.4 0 


4.18i0.3 0 


4.09i0.3 


0 


sleep 


2.17±0.2 


l.OOiO.O 0 


l.OOiO.O 0 


l.OOiO.O o 


l.OOiO.O 


o 


strike 


16.65±1.5 


4.68±0.9 0 


4.78i0.9 0 


4.95il.O 0 


4.86il.l 


0 


veteran 


6.48±0.6 


1.26±0.3 0 


1.27i0.3 0 


1.29i0.2 o 


1.36i0.3 


0 


vineyard 


2.77±0.1 


2.27±0.2 0 


2.07±0.2 0 


2.18i0.2 0 


2.07i0.2 


0 



o,« statistically significant improvement or degradation 



MS'Rules variants on nine datasets, eight of which are the same for all variants. 
When rule set sizes are compared, it can be seen that MS'Rules produces smaller 
rule sets than Cubist more often than not. % RMS and CC x Cover produce 
smaller rule sets than Cubist on twenty-three datasets, and MAE / Cover and 
Cover produce smaller ones on twenty-two datasets. Cubist, on the other hand, 
produces smaller rule sets than all variants of MS'Rules on only six datasets. 
Prom Table 4, it can be seen that in many cases MS'Rules produces far fewer 
rules than Cubist. For example, on the sensory dataset Cubist produces just 
over forty-five rules, while MS'Rules is more accurate with approximately four 
rules. Furthermore, from Table 5 it can be seen that M5'Rules generates fewer 
conditions per rule set than Cubist — it is significantly better on twenty-four and 
worse on at most five. For some datasets (sensory, pbc, breastTumor, cholesterol) 
the differences are dramatic. 

5 Conclusion 



We have presented an algorithm for generating rules for numeric prediction by 
applying the separate-and-conquer technique to generate a sequence of model 




10 



Geoffrey Holmes et al. 



Table 5. Experimental results: average number of conditions per rule set pro- 
duced by Mb'Rules compared with average number of conditions per rule set 
produced by Cubist 



Datetset 


Cubist 


M5'R 
% RMS 


M5'R 

MAE/Cover 


WR 
CC X Cover 


M5'R 

Cover 




auto93 


4.37±0.6 


0.08±0.1 o 


0.08i0.1 0 


0.09i0.1 o 


0.09i0.1 


0 


autoHorse 


11.60±1.1 


4.47±2.2 o 


3.89il.5 0 


3.96i2.2 o 


3.63il.3 


o 


autoMDE 


14.13±1.6 


5.17±1.1 o 


3.50i0.4 o 


3.53i0.5 o 


3.49i0.5 


0 


autoPrice 


5.63±0.6 


8.61±1.7 • 


7.64il.4 • 


7.19il.0 • 


5.98i0.7 




baskball 


11.75±0.7 


O.OOiO.O o 


O.OOiO.O o 


O.OOiO.O 0 


O.OOiO.O 


0 


bodyfat 


0.76±0.5 


6.86±1.7 • 


4.07i0.8 • 


3.00i0.6 • 


3.08i0.6 


• 


breastTumor 


81.32±3.9 


0.06±0.1 o 


O.OeiO.l o 


o.oeio.i o 


O.OSiO.l 


0 


cholesterol 


81.86±4.7 


1.78±0.8 o 


1.48i0.6 o 


1.58i0.6 o 


1.47i0.5 


0 


Cleveland 


27.37±4.3 


0.19±0.3 o 


0.15i0.3 0 


0.24i0.4 o 


0.21i0.4 


0 


cloud 


0.18±0.1 


2.23±0.7 • 


1.59i0.4 • 


1.79i0.6 • 


1.76i0.6 


• 


cpu 


2.00±0.0 


1.91±0.2 


1.86i0.3 


1.82i0.2 


1.82i0.2 




echoMonths 


16.53±1.7 


O.OOiO.O o 


O.OOiO.O o 


O.OOiO.O o 


O.OOiO.O 


0 


elusage 


2.00±0.0 


0.64±0.2 o 


0.63i0.2 o 


0.62i0.2 o 


0.62i0.2 


0 


fishcatch 


2.00±0.0 


4.43il.l • 


3.66i0.9 • 


3.40i0.6 • 


3.40i0.6 


• 


housing 


18.28±1.6 


30.44±6.9 • 


18.28i2.7 


16.30i2.6 


15.73i2.1 




hungarian 


33.04±3.3 


0.88i0.3 o 


0.79i0.3 0 


0.81i0.3 o 


0.84i0.4 


o 


lowbwt 


21.19±0.9 


O.OSiO.l o 


O.OSiO.l 0 


O.OOiO.l o 


O.OSiO.l 


0 


mbagrade 


6.61±0.4 


O.OOiO.O o 


O.OOiO.O o 


O.OOiO.O o 


O.OOiO.O 


o 


meta 


25.87±0.6 


8.81il.O o 


7.34i0.7 o 


10.53il.8 o 


6.49i0.9 


0 


pbc 


100.86±5.6 


o.esio.i o 


O.eSiO.l 0 


o.esio.i 0 


O.SSiO.l 


o 


pharynx 


7.96±0.1 


2.70il.5 o 


1.86i0.9 0 


i.eoio.e 0 


1.79i0.7 


0 


pollution 


1.18±0.5 


0.28i0.2 o 


0.25i0.1 o 


0.25i0.1 0 


0.25i0.1 


0 


pwLinear 


2.00±0.0 


l.OOiO.O 0 


l.OOiO.O 0 


l.OOiO.O 0 


l.OOiO.O 


0 


quake 


9.64±3.0 


3.04i0.5 o 


2.66i0.4 0 


3.62i0.9 0 


2.S9i0.5 


0 


sensory 


218.80±6.8 


8.88il.6 0 


S.43il.l 0 


5.70i0.9 0 


5.40il.2 


0 


servo 


13.33±0.2 


7.42i0.9 o 


7.15i0.9 o 


5.19i0.6 0 


5.05i0.5 


0 


sleep 


2.38±0.8 


O.OOiO.O o 


O.OOiO.O 0 


O.OOiO.O 0 


O.OOiO.O 


0 


strike 


46.55±5.2 


9.27i2.9 o 


4.98il.2 0 


7.40i2.1 0 


5.25il.S 


0 


veteran 


19.63±2.5 


O.SOiO.S o 


0.42i0.3 o 


0.41i0.3 0 


0.41i0.4 


0 


vineyard 


4.13±0.4 


1.90i0.2 o 


1.56i0.2 0 


1.53i0.2 0 


1.5Si0.2 


0 



o,« statistically significant improvement or degradation 



trees, reading one rule off each of the trees. The algorithm is straightforward 
to implement and relatively insensitive to the heuristic used to select competing 
rules from the tree at each iteration. Mb'Rules using the coverage heuristic is sig- 
nificantly worse, in terms of accuracy, on only one (autoPrice) of the thirty bench 
mark datasets when compared with MS'. In terms of compactness, MS'Rules 
never produces larger rule sets and produces smaller sets on ten datasets. When 
compared to the commercial system Cubist, MS'Rules outperforms it on size and 
is comparable on accuracy. When based on the number of leaves it is more than 
three times more likely to produce significantly fewer rules. When the number of 
conditions per rule are used to estimate rule size MS'Rules is eight times more 
likely to produce rules with fewer conditions than Cubist. 

Published results with smoothed trees [IS] indicate that the smoothing pro- 
cedure substantially increases the accureicy of predictions. Smoothing cannot be 
applied to rules in the same way as for trees because the tree containing the 
relevant adjacent models is discarded at each iteration of the rule generation 
process. It seems more likely that improvements to MS'Rules will have to be 
made as a post-processing optimization stage. This is unfortunate because gen- 
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Table 6. Results of paired t-tests (p = 0.01): number indicates how often method 
in column significantly outperforms method in row; number in braces indicates 
how often method in column produces significantly fewer rules than method in 
row. 





Cubist 


MS' 


% RMS 


MAE / Cover 


CC X Cover 


Cover 


Cubist 


- 


12 {20} 


12 {23} 


10 {22} 


12 {23} 


11 {22} 


M5' 


8 {6} 


- 


0{12} 


0 {11} 


0 {11} 


0{10} 


% RMS 


9 {6} 


2 {1} 


- 


1 {2} 


0{4} 


1 {5} 


MAE / Cover 


9 {6} 


2 {1} 


1 {2} 


- 


1 {4} 


1 {2} 


CC X Cover 


9(6} 


2 {0} 


1 {2} 


1 {1} 


- 


1 {2} 


Cover 


9 {6} 


1 {0} 


0{3} 


0 {1} 


0{2} 


- 



eration of accurate rule sets without global optimization is a compelling aspect 
of the basic PAR.T procedure, on which MS'Rules is based. However, smoothing 
usually increases the complexity of the linear models at the leaf nodes, making 
the resulting predictor more difficult to analyze. 
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Abstract. This paper presents a method to classify ^lnd learn cricket 
shots. The procedure begins by extracting the camera motion pEiram- 
eters from the shots. Then the C 2 unera parameter veJues 2 u:e converted 
to symbohc form and combined to generate a symbolic description that 
defines the trajectory of the cricket ball. The description generated is 
used to classify the cricket shot and to dynamically expand or update 
the system’s knowledge of shots. The first novel eispect of this approeich 
is that by using the camera motion parameters, a complex and difficult 
process of low level image segmenting of either the batsman or the cricket 
ball from video images is avoided. Also the method does not require high 
resolution images. Another novel aspect of this work is the use of a new 
incremental learning algorithm that enables the system to improve and 
update its knowledge base. Unlike previously developed algorithms which 
store training instances and have simple method to prune their concept 
hierarchies, the incremental learning algorithm used in this work gen- 
erates compeict concept hierarchies and uses evidence based forgetting. 
The results show that the system performs well in the task of classifying 
four types of cricket shots. 



1 Introduction 

In this paper we present a system which uses camera motion to recognise and 
learn cricket shots. The work is part of a multimedia project which aims to use 
transcripts of commentary and image clues to recognise, learn and produce a 
natural language description of action/s taking place in the video segment. The 
domain of our work is sports and we have previously developed a system that is 
able to recognise and learn American Football plays based on the transcript of the 
commentary and video clues [7], [6], [8]. Essentially the transcript clues are used 
to constrain the search for a match for the play in the video segment while the 
video clues are used to refine the solution. The system develops detailed spatio- 
temporal models of the American Football plays which can be easily converted 
into detailed text descriptions of them. This offers the advantage that the system 
can generate descriptions for the plays which are far more detailed than that 
present in normal commentary. 

In the current work we explore the possibility of applying the work done in 
American Football to other sports, in this case, cricket. The reason for choosing 

N. Foo (Ed.): AI’99, LNAI 1747, pp. 13-23, 1999. 

(c) Springer-Verlag Berlin Heidelberg 1999 
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cricket is that though cricket is a lot simpler than American Football, it still in- 
volves well defined actions — cricket shots — for which accurate representations 
can be built. We define a cricket shot as the way in which the batsman hits the 
ball he is facing and propose a classification scheme based on the direction in 
which the ball is hit and the distance covered by the ball. 

The objective of our research is to combine camera motion estimation with 
incremental machine learning to classify and learn types of cricket shots. The 
video processing is fast, does not require high resolution images and avoids com- 
plex low level segmentation. An important aspect of this work is the use of a new 
incremental learning algorithm. The incremental learning involves updating ex- 
isting symbolic descriptions of the cricket shots in order to keep them consistent 
with incoming data. 

The paper is organised as follows. Section two presents some background 
information on camera motion estimation and incremental machine learning. 
Section three covers the way in which we extract the data from video and process 
it. Section four presents the results and section five contains the conclusions. 



2 Previous Work 

The problem of estimating motion parameters has been researched extensively 
in the past since it provides a simple, fast and accurate way to search multimedia 
databases for specific shots (for example a shot of a landscape is likely to involve 
a significant amount of pan, whilst a shot of an aerobatic sequence is likely to 
contain roll). 

Bergen’s et al. [4] method is based on two models: a global model that con- 
strains the overall motion estimated and a local model that is used in the estima- 
tion process. Afiine flow, planar flow, rigid body motion and general optic flow 
are the four specific models chosen. The same objective function is used in all 
models and the minimisation is performed with respect to different parameters. 

Akutsu et al. [1] have proposed a method based on analysing the distribu- 
tion of motion vectors in Hough space. Seven categories of camera motion are 
estimated: pan, tilt, zoom, pan and tilt, pan and zoom, tilt and zoom, and pan, 
tilt and zoom. Each category has a different signature curve in the Hough space. 
Estimation of the motion parameters is based on Hough-transformed optic-flow 
vectors measured from the image sequence and determining which of the signa- 
tures best matches the data in a least squares sense. 

Park et al. [11] describe a method of estimating camera parameters that es- 
tablishes feature based correspondence between frames. The camera parameters 
representing zoom, focal length and 3D rotation are estimated by fitting the 
correspondence data to a transformation model based on perspective projection. 

Tse and Baker [13] present an algorithm to compensate for camera zoom 
and pan. The global motion in each frame is modelled by just two parameters: a 
zoom factor and a pan and tilt factor based on local displacement vectors found 
by conventional means. 
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Wu and Kittler [14] present a technique to extract rotation, change of scale, 
and translation from an image sequence without establishing correspondence. A 
multi-resolution iterative algorithm that uses a Taylor series approximation of 
the spatial and temporal gradient of the image is used. 

To extract the camera motion parameters from a sequence of images we use 
a method developed by Srinivasan et aJ [10]. The method can qualitatively es- 
timate camera pan, tilt, zoom, roll, and horizontal and vertical tracking. Unlike 
most other comparable techniques, this method can distinguish pan from hori- 
zontal tracking, and tilt from vertical tracking. 

Several methods of learning have been developed but the one that is best 
suited for real world situations is incremental learning. Human beings learn 
incrementally because facts are generally received in the form of a sequential 
flow of information. Typically the information received comes in steps and human 
beings have to learn how to deal with a situation long before all the facts are 
available. Further, humans have limited memory and processing power [12]. 

There are several important issues in incremental learning which have been 
identified; bias, concept drift, memory size and forgetting. Several systems have 
been designed (ILF [7], [5], GEM [12], COBWEB [3], [2] and UNIMEM [9]) to 
address some of the issues mentioned above. 

We use the incremental learning algorithm ILF [7,5]. The concepts developed 
by our algorithm are stored in a hierarchy in which all descriptions share all the 
features observed in the training instances. The descriptions in our structure do 
not store the individual instances of the cricket shots and any feature’s range of 
values is defined with the help of a set which covers all the values encountered 
in the training instances which were used in the generalisation process. There 
are several reasons for choosing this type of representation. The reason for not 
storing the instances is that it allows the system to detect any drift in the 
target concept and also by using all observed features to build the concepts, 
the system is able to handle cases of missing or noisy data. Furthermore this 
representation substantially reduces the size of the concept and as a result the 
amount of memory required to store the hierarchy. 

3 Extracting and Converting the Ceimera Motion 
P 2 iranieters 

There are two ways in which a cricket shot can be determined. One method 
involves segmenting out either the batsman or the ball. Unlike American Football 
where it is possible to track some of the players in the play, in cricket it is much 
more difficult to track the batsman because of two reasons: the high speed of 
the shot (few frames, too much blur as shown in Figure 1) and the batsman is 
often occluded by the wicket keeper. Furthermore, even if the batsman could be 
consistently segmented out from the image, it is still difficult to distinguish the 
action of the batsman (the bat cannot be identified consistently so its pattern 
of movement cannot be accurately classified). It is similarly difficult to segment 
out the cricket ball in video especially since the cricket ball is small and difficult 




16 



Mihai Laaarescu et al. 



to distinguish from the background (generally there is no significant difference 
between the ball and backgound). An alternative method is to use the camera 
motion parameters to determine the path of the ball. Throughout the cricket 
game the camera generally focuses on the ball trajectory and hence it is possible 
to generate a hypothesis on the type of cricket shot based on the camera motion 
parameters which in turn define the direction of the ball. There are three stages 




Fig. 1. Two typical sequences of a batsman attempting a shot. Images provided 
with the courtesy of Wide World of Sports - Channel 9 Australia. 



in the processing of the camera motion parameters. 

In the first stage the system attempts to determine the size of the window 
that contains the cricket shot, that is the start and end of the shot sequence 
containing the shot (the camera position is assumed to be behind one the two 
bowling ends of the cricket ground to capture the bowling action). The cricket 
action consists of two parts; the bowler action and the batsman action. The 
bowler’s action is defined by a sequence of frames in which there is a substantial 
amount of zoom and tilt but little pan. This is because the camera is tracking 
and zooming on the bowler and the cricket ball. Both the bowler and the ball 
move fairly straight and hence there is no substantial panning. The batsman 
action is defined by a sudden change in the direction of the ball which involves 
a significant amount of pan. The cricket shot ends with a cut when the camera 
focuses on the crowd, the ground or a field player who has fielded the cricket 
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ball. The cricket shot window therefore starts at the frame where the system 
has detected a sudden change in the camera parameters and ends at the frame 
where the system detects a cut. 

The second stage involves determining the dominant motion of the camera 
in the cricket shot. The reason for checking for a dominant motion is that the 
movement of the camera during a shot is not always smooth and contains varying 
amounts of zoom and tilt as it all depends on how good the camera man is at 
tracking the ball: the more experienced, the smoother the action. Hence it is quite 
likely that a drive on the left side will contain some small movement to the right 
which occurs while the camera man attempted to track the ball. Such movement 
is essentially noisy and must be eliminated to be able to determine the real 
camera movement. The system analyses the entire sequence and determines the 
dominant movement of the camera (for example whether the overall movement 
was to the left or to the right) by computing the frequency of the negative and 
positive values for the camera parameters. The most frequent sign determines 
the dominant motion for that category of camera parameters. For example if the 
values for the pan parameter are mostly positive then the dominant motion is to 
the right otherwise the movement is to the left. The symbolic values classifying 
the dominant motion for all camera parameters are shown in Table 1 (the actual 
pan/tilt /roll value is not considered at this stage when converting it to symbolic 
form — just the sign). 



Table 1. Symbolic values classifying the dominant camera motion. 



Camera 

Parameter 


Symbolic 

Value 


Vailue 

Sign 


Pan 


Right 


Positive 


Pam 


Left 


Negative 


Tilt 


Up 


Positive 


Tilt 


Down 


Negative 


RoU 


Clockwise 


Positive 


RoU 


Anticlockwise 


Negative 


Zoom 


Zoom-In 


Positive 


Zoom 


Zoom-Out 


Negative 



The third stage involves a more refined classification of the camera motion. 
Once the dominant motion has been identified much of the noise is removed. 
Then for each camera parameter in turn, the system collects all the values from 
the sequence and computes an average. This average value indicates how far and 
how fast the camera moved during the cricket shot. The average value is also 
converted into a symbolic value (the threshold values defining the boundaries of 
the types of camera motion were obtained using a simple clustering technique — 
generally the values fall into three intervals, for example for the Gabba Cricket 
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Table 2. Symbolic values for the average camera motion parameter value. 



Camera 

Parameter 


Symbolic 

Value 


Average 

Value 


Pein 


Right 


High Positive Values 


Pan 


Left 


High Negative Values 


Pan 


Middle 


Close To Zero 


Tilt 


Up 


High Positive Values 


Tilt 


Down 


High Negative Values 


Tilt 


Centered 


Close To Zero 


RoU 


Clockwise 


High Positive Values 


RoU 


Anticlockwise 


High Negative Values 


Roll 


Static 


Close To Zero 


Zoom 


Zoom-In 


High Positive Values 


Zoom 


Zoom-Out 


High Negative Values 


Zoom 


Steady 


Close To Zero 



ground the intervals were [-30. ..-1], [-0.1. .0.1], [1..30]). Table 2 shows the symbolic 
values used to describe the average camera parameter values. 

Once symbolic values for both the dominant motion and the average camera 
motion have been obtained for each of the four camera parameters, the system 
can identify the type of the cricket shot. Each shot can, therefore, be expressed 
as a sequence of symbolic values. Figure 2 shows types of cricket shot which the 
system is attempting to indentify. Besides the eight symbolic values representing 
the camera parameters, the system also generates a ninth symbolic value which 
defines the length of the shot. The length of the shot is simply derived from 
the duration of the video shot sequence (the number of frames in a shot). For 
example consider the shot long straight drive (on the left side). One possible set 
of symbolic values for the camera parameters is shown in Table 3. 



Table 3. Symbolic values classifying the dominant camera motion and average 
motion for a long straight drive (on the left side). 



Camera 

Parameter 


Dominant 

Motion 


Average 

Value 


Pan 


Left 


Close To Zero 


Tilt 


Down 


High Negative V^Jue 


Roll 


Clockwise 


High Negative VsJue 


Zoom 


Zoom-Out 


High Negative Vadue 



Extracting camera parameters for the frames that make up a window contain- 
ing the cricket shot provides information about the ball trajectory. To determine 
its type, we classify the shot based on the three parameters: dominant motion, 
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Fig. 2. The six cricket shots the system attempts to identify: the drive shot (left 
, right or straight), the pull shot (left or right) and the hook shot. 



average camera motion value and length. However the values of two of the pa- 
rameters along which a shot is classified namely, dominant motion and average 
camera motion value, vary from game to game. This is due to the following 
factors: 

— Camera position varies on different cricket grounds mainly in the angle at 
which the shot is captured (right behind wicket or at a slight angle) and 
the perspective from which the shot is taken (low down or high up in the 
stands). 

— Ground shape and size also varies on different cricket grounds (for example 
the Brisbane Gabba Cricket Ground is smaller than the Melbourne Cricket 
Ground and a shot such as a drive involves slightly different camera motion 
— different pan and zoom). 

The goal of our work has been to build general cricket shot descriptions to 
enable the system to classify shots from different cricket grounds and therefore 
any ground specific information needs to removed. To deal with variations in the 
parameters we use incremental learning. We build representations of the shots 
using symbolic data and update these representations when necessary. 

4 Classifying £ind Learning Cricket Shots 

Training data is extracted from a video sequence, such that several shots are 
extracted and their description are derived in terms of symbolic movement data. 
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symbolic average value and shot length. We use the incremental learning algo- 
rithm ILF [7], [5]. 

The way in which the system learns from the incoming cricket shot descrip- 
tions is as follows. The symbolic descriptions generated by the data analysis 
module are passed on to the incremental learning module which first attempts 
to find a match for the shot in the existing hierarchy of shots. Each new shot 
description is compared with the current description in the hierarchy to deter- 
mine if there is enough evidence to justify the update of the current description. 
Each description in the hierarchy produces an evidence score which determines 
whether the shot does or does not match the current description. The score is 
computed as a function of age, where age records the duration in time the shot 
is known to ILF. 

The general format of the cricket shot type description is shown in Figure 3. 
This shows that there are n shots, with shot 3 having m examples in its de- 
scription. In this way multiple descriptions can be updated (provided enough 




Fig. 3. A typical description hierarchy used by the incremental learning algo- 
rithm. Each shot description has an age and 9 attributes and each one of the 
attributes has a set of data values-l-age values associated with it. 



evidence was found) by the same shot. While this procedure results in the sys- 
tem updating descriptions which should not be updated when one considers the 
overall set of shots, results show that over time the unnecessary modifications 
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are “aged out” of the descriptions. That is unless a particular shot description 
gets reinforced by other similar ones, it forgotten. The main reason for choosing 
to update multiple descriptions is that it is a simple way of representing a fuzzy 
match between the existing description and the input cricket shot which is more 
appropriate than an absolute match (as we mentioned above, the camera pa- 
rameters vary from game to game). The update process is based on data ageing. 
The algorithm uses ageing at two levels: the data level and the description level. 
Therefore, when a new cricket shot is processed there are three possible out- 
comes. The first is that the system finds a match for it in the existing hierarchy 
of shot descriptions. The description that matches the input cricket shot is up- 
dated to be consistent with the new data. The second outcome is that the system 
does not find a match for the new cricket shot so a new description is added to 
the hierarchy while updating the existing shot descriptions. The third outcome 
from processing a new cricket shot is that one (or possibly more) description in 
the hierarchy get removed since the data is “aged out” . 



5 Results 

We have trained the system on 63 cricket shots and tested it on 80 shots. The 
video segments used in our work had a length varying from 24 to 190 frames. 
The shots have been collected from 5 cricket games and cover five types of shots. 
Two of the shots occur rarely (most players tend to avoid them since they are 
high risk shots) and as a result very few instances were present in either the 
training or the test sets. The training data is shown in Table 4 and the test data 
is shown in Table 5. The results of the classification are shown in Table 6. The 
system performed very well when classifying four types of shots: the pull shot 
(left and right) and drive shot (left and right). The overall success rate averaged 
at 77% (slightly higher on right hand side — this was due to the fact that more 
data were available to train the system on the right hand side). The results 
also show that the system was not able to build an accurate description for the 
straight drive shot and hence it was unable to accurately classify the shot. In 
the case of the straight drive shot the system has simply been unable to identify 
any significant differences between the camera parameters for a straight drive 
and a drive on each side of the ground (the pan values do not vary as much as 
expected). 



Table 4. The training data. 



Number of Shots 


1 Side 


Shot Type 


25 


Itisaial 


Drive 


15 


Left 


Drive 


18 




PuU 


14 


Left 


Pull 
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Table 5. The test data. 



Number of Shots 


1 Side 


Shot Type 


24 




Drive 




1 Left 


Drive 






Pull 


18 


Left 


Pull 



Table 6. The results of the tests. 



1 Side 


Shot Type 


Correct 


Incorrect 


liMIS 


Drive 


19 


5 


11^ 


Drive 


18 


2 


\ismm 


Pull 


15 


5 


11^ 


Pull 


12 


6 



6 Conclusions 

We have developed a system which classifies and learns cricket shots. The system 
uses the camera motion parameters to estimate the direction of the ball in a 
cricket shot. The procedure to extract the camera parameters is fast, robust and 
avoids complex and costly low level image segmentation. The camera motion 
parameters are converted into a symbolic description of the camera movement 
which defines the trajectory of the ball. The symbolic description is then used to 
classify and learn cricket shots. The learning process uses forgetting and allows 
the system to keep existing cricket shot descriptions consistent with incoming 
data. The results show that the system performs well in the task of classifying 
four types of cricket shot with an average success rate at 77%. 
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Abstract. The Metagame approach to computer game playing, intro- 
duced by Pell, involves writing progrsuns that cam play many games from 
some large class, rather them programs speciailised to play just a single 
game such ais chess. Metagame programs take the rules of a randomly 
generated game as input, then do some amalysis of that game, and then 
play the game against am opponent. Success in Metagame competitions 
is evidence of a more general kind of ability than that possessed by (for 
example) a chess program or a draughts program. In this paper, we take 
up one of Pell’s challenges by building a Metagaune player that can learn. 
The learning techniques used are a refinement of the regression methods 
of Christensen and Korf, and they are applied to imsupervised learn- 
ing, from self-play, of the weights of the components (or advisors) of the 
evailuation function. The method used leads to significant improvement 
in playing strength for many (but not all) games in the class. We also 
shed light on some curious behaviour of some advisor weights. In order 
to conduct this research, a new and more efficient Metagame player weis 
written. 



1 Introduction 

Metagame is a relatively new field of Artificial Intelligence research originated by 
Pell [8,9,10,11,12]. The basic idea is to develop and compare programs which can 
analyse and play any game from some general class, rather than just a single 
game. A Metagame program takes the rules (suitably encoded) to some new 
game (possibly randomly generated), performs some analysis of the game, then 
plays that game against other players. The performance of such programs can 
be compared quantitatively using the results of games and tournaments. The 
hope is that success (by a program) at Metagame is indicative of a more general 
kind of problem solving ability than that possessed by a program which plays a 
single game such as Chess. Analysis of particular games in the class is intended 
to be done by the program rather than a human. Metagame is thus intended as 
a testbed for AI ideas, and in many respects may be a better one than Chess. 

Pell introduced the class SCL of Symmetric Chess-Like games as a domain for 
Metagame play. This class consists of games between two players on a rectangular 
board, with pieces, moves, captures, promotion, goals and other ingredients. For 
full details of the SCL class, see [11]; we give a little more detail in §3. 



N. Foo (Ed.): AI’99, LNAI 1747, pp. 24-35, 1999. 
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Unsupervised Learning in Metageime 25 



Pell developed a Metagame player (in Prolog) called METAGAMER for the 
SCL class, and ran some tournaments with several Metagame-playing algo- 
rithms as participants (playing several randomly generated games from the 
class). The tournament participants themselves were fairly elementary (being 
just bit-players in Pell’s work) but the work did serve to demonstrate the feasi- 
bility of the Metagame paradigm. Pell pointed to many useful avenues for further 
research, including the use of Metagame as a testbed for learning. 

The purpose of this paper is to take up that challenge, in particular to apply 
some unsupervised learning techniques to Metagame. It is natural for the learn- 
ing to be unsupervised because of the lack of prior human expertise on the SCL 
class as a whole. We construct what may well be the world’s best Metagame 
player (albeit for a slight restriction of the SCL class). It is able to learn well 
enough to significantly improve play over much of the class considered. We find, 
though, that learning well enough to improve play at every single game in (even 
our restriction of) the whole SCL class is beyond our methods. We will discuss 
some of the problems faced and the lessons learned. 

The game playing algorithms considered here are of a very standard type: 
fixed depth a-/3 search, with the positions at the leaves of this search tree being 
valued according to an evaluation function. The evaluation function is a linear 
combination of advisors, where an advisor is a function of the position which re- 
turns something helpful like material balance or mobility. Advisors in Metagame 
capture concepts which apply across (at least a large portion of) the SCL class. 
Advisors are symmetric in the sense that interchanging colours in a position will 
simply negate an advisor’s value. Further information on the actual advisors 
used is given in the next Section. The coefiicients of the advisors are the weights. 
If the choice of advisors and search depth are given, then playing strategy is en- 
tirely determined by these weights. In this paper we study a method of learning 
the weights. 

The learning techniques used are a refinement of the regression methods 
of Christensen and Korf [2]. Their method involved starting with some initial 
estimates of the weights in the static evaluation function and solving a series 
of linear regression problems to obtain a series of successively improved sets 
of weights. A variety of initial estimates may be tried in order to increase the 
chance of finding a good local optimum. They apply their method to learning 
weights of pieces in Chess, obtaining interesting results which they discuss. Here 
the technique is adapted to learn weights of evaluation function components in 
Metagame. This is apparently more difficult and we describe the refinements 
we found it necessary to introduce. We also shed light on correlations between 
advisors, the effect this has on the advisor weights learned, and the import of 
negative weights (which turn out not to be a problem after all). 

The experiments conducted required extensive computations. Pell’s 
Metagame player (in Prolog), while having many nice features, was too slow 
for this purpose. One of us (Powell) wrote a new Metagame player, for most of 
the SCL class, in C, which was fast enough. 
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We briefly describe some other relevant work. Epstein [4] constructed a pro- 
gram, Hoyle, which also plays games from some class, but does no analysis of the 
rules of the game. It does some supervised learning; for each game, it requires 
a game-specific program written by a human expert from which to learn. More 
recently, Epstein, Gelfand and Lesniak [5] have added a pattern-based learning 
capability to Hoyle. The program Morphll, due to Levinson [6,7], can play a very 
general class of games defined in predicate logic, and a couple of learning methods 
have been implemented. Samuel [13] did pioneering work on the unsupervised 
learning of evaluation function weights (applied to Draughts), although he used 
an iterative updating method different to the regression technique employed here 
and in [2] . Abramson [1] applied regression to learning evaluation functions that 
estimate the expected outcome (under random play) of a game. The technique 
is widely applicable, and he tried it out on Othello and Chess with interest- 
ing results. Pell [11,12] has suggested applying the Temporal Difference (TD) 
methods of Sutton [15], and applied with notable success to Backgammon by 
Tesauro [16,17], to Metagame. One practical difficulty for us, if such methods 
were to be applied to Metagame, is the much greater number of games required. 

In the remainder of this paper we describe the learning methods we use, 
report the experimental results obtained, discuss them, draw some conclusions 
and offer some suggestions for future work. 



2 Methods 

As usual (e.g. [14]), an evaluation function vq is a linear combination of compo- 
nents Gj, j = 1, . . . , n (which, in the context of Metagame, Pell [11] calls advi- 
sors). Both Vo and its advisors are functions of the position P. The coeflicients 
(or weights) of the advisors are denoted hy Xj, j — 1, . . . , n and are independent 
of position. It is often convenient to speak of the weight vector X = (Aj)”^j. 
Thus: 

n 

uo(A,P)=5]A,o,(P). 

j=i 

This is assumed to be some sort of estimate of the value of the position P. It 
is assumed that a better estimate of the “true” value of P can be obtained by 
searching to depth d and backing up the static evaluations given by vq at the 
leaves of this search tree; denote this depth-d evaluation of P by Ud(A, P). These 
assumptions are standard, although not rigorously justified (see [1]). 

As in [13], then, one way to improve a set of weights is to adjust them so 
that the resulting evaluation function more nearly approximates the backed up 
value obtained from using the current weights at the leaves. As in [2], we do this 
adjustment by starting with some initial weight vector A^°^ and then, for each 
iteration A: = 1, 2, . . ., assembling equations 
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for a suitably large number of positions P and finding the least squares solution 
of this over-constrained system for This process may converge on a new 
weight vector A. 

We thus need a large number of positions P. We generate these positions 
by getting our program to play the current game (i.e., the particular SCL game 
which the program is trying to learn) many times, and introducing some ran- 
domness into its play in order to ensure that we get a reasonably wide variety of 
positions. Such self-play allows us to sample positions from a game in a way that 
is natural for each game, as well as showing (from the results of the games) how 
well a particular learning method is doing (see below). The resulting weights are 
average weights over all stages of the game, as in [2] . 

Of course, as noted in [2], the weight vector A found by the above process 
might not be globally optimum (with respect to playing strength). In fact just 
doing the above procedure may result in a weight vector that is worse than the 
initial one (since some least squares solutions of the system 

VP uo(A) = Ud(A,P) 

are bad, i.e. may cause poorer play if used as weights), so care needs to be taken 
to prevent such degeneration. As well as picking a reasonable initial vector (typ- 
ically, having all components equal to 1; such “uniform” vectors were used by 
Pell [11]), we find it helpful to periodically check on the game-playing perfor- 
mance of our weight vectors and take steps to ensure that what we do does 
actually improve things. 

We now summarise our learning procedure. 

Fix: 



d — depth of search (typically, 3); 

N = number of equations per learning iteration 
(typically, 1000; see below). 

1. Input: rules of new SCL game G. 

Aim: to learn a good weight vector A for G. 

2. Initialisation: 

A^°^ := 1 (i.e., vector of n ones) 

3. Games to be played between two computer players, each using a-P search 
to depth d: 

- one player uses fixed weights A^°^ (and does not learn); 

- the other uses the current weights A^*'^ (the “Learning Player”). 

4. fc := 1 

5. k-th Learning Iteration: 

Repeat the following until a predetermined number of games have been 
played. 

1. Play enough games between these two players to collect N linear equa- 
tions in A, one for each position P and each equation having the form 

voiX,P) = Vd{X^'^-^\P). 




28 



Graham E. Farr and David R. Powell 



Also keep a record of the Learning Player’s score over these games 
(where, for each game, win = 1, draw = 1/2, loss == 0). Let this score, 
as a fraction of games played during this Learning Iteration, be Pk-i- 
(Thus, for each fc, pk is an indication of how successful the weight vector 
was.) 

2. Use linear regression to find the least squares solution to this system of 
equations for A. (We used LAPACK version 2.0.) 

3. Update weight vector according to how much of an improvement it is 
over the previous weight vector: 

A('=> := (l-pfe_i)A+pfc_iA(*-i) 

(Remark: this was found to give better results than just putting A^*’^ := 
The latter gave erratic performance, frequently throwing away 
good weight vectors in favour of inferior ones. Refer to our remarks 
above on the dangers of converging on a bad weight vector.) 

4. Next Learning Iteration: k k + 1. 

6. Combining weight vectors; A final weight vector A* is chosen from all the 

A^*^ found at each of the successive learning iterations. Three methods were 

tried. 

One only of the following is done: 

(a) A* = the A^*’^ with the highest pk (i.e. the best record). 

(b) 

(c) A* = ^k:pk>i/ 2 Pk^^'^^ ■ oiily “winning” vectors — those whose 
average score per game is more than 1/2 — are included in the weighted 
sum.) 

A number of technical details should be mentioned. 

Mathematically, the weight vectors do not need to be normalised, although 
no harm is done in doing so (and it may become necessary in practice if the 
weights get too large). 

The value assigned to winning positions was simply 1.2 times the largest value 
previously occurring in the collected equations. This has the effect of giving 
a winning position a value higher than any other position encountered in the 
game, but not completely dominating the collected equations. It also works for 
different games where the highest position value during the game is unknown. 
This is similar in spirit to [2], where a win is one plus the total material value 
at the start of the game. Our choice of 1.2 is fairly arbitrary. 

In order to ensure that, for any given SCL game, the games played were 
reasonably varied, a small random amount was added to position evaluations for 
the first few (typically, 10) moves of each game. This was done so as to affect 
choice of move without affecting the data used for the equations. 

Draws can arise in any of three ways. Firstly, games are declared drawn after 
200 moves. Secondly, draw by repetition: this was found to be important for the 
quality of equations generated. We curtail the game after two back-and-forth 
moves by both players so that repeated positions do not swamp aud distort the 
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data. Finally, a draw can occur when both players meet goals at the same time. 
An example of this could be when the white player meets his goal by taking the 
opponent’s King (where there is an eradicate goal on that King), but at the same 
time lands on a square which meets an arrive goal of the opponent. 

In Table 1 we give a brief description of the actual advisors used, which 
were the same as those of Pell [10,11,12]. All advisors are symmetric in the 
sense mentioned in the previous Section. Furthermore, where contributions (such 
as piece value, mobility, number of captures) from all pieces on the board are 
summed, the contributions are positive for the player’s pieces and negative for 
the opponent’s. Pell’s work gives further details. 



advisor 


description 


Material 


Total value of pieces on board 




(individual piece values precomputed, similarly to [11]) 


Static mobility 


Total empty-board single-move moving mobility of pieces in cur- 
rent position 


Eventual mobility 


Total empty-board multiple-move moving mobility of pieces cur- 
rently on board (where the contribution due to a series of moves 
decays exponentially with number of moves) 


Dynamic mobility 


Total moving mobility with aU pieces in current position 


Capture mobility 


Total number of captures available in current position 


Possess 


Sum of values (averaged over empty-board squares) of all pos- 
sessed pieces (i.e. pieces in hauid, as in Shogi) 


Global threat 


Vedue of player’s best threat minus value of opponent’s best threat 


Arrive 


Sum, over pieces with an 2 u:rive goal, of measure of difficulty of 
achieving the goad 



Table 1. Advisors used (based on Pell [10, 11]). 



3 Results 

A number of experiments were run to test the performance of our learning meth- 
ods. 

The reader is reminded that our Metagame player, written in C by Pow- 
ell, is not quite as general as Pell’s in that it does not quite cover the whole 
SCL Metagame class. Some games in Pell’s class are omitted from ours, sim- 
ply because the extra generality required would have taken a disproportionate 
amount of time to implement for the benefit gained. Our class is still very gen- 
eral, and includes, for example, the following features: rectangular boards of any 
size; goals based on stalemating, arrival or eradication, either of the player’s or 
opponent’s pieces, and arbitrary disjunctions of such goals; leaping, riding and 
hopping pieces; pieces which capture differently to the way they move; different 
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capture effects (removal from game, or possession by the player or the opponent); 
promotion (to either player’s or opponent’s piece); a possible must-capture con- 
straint applying either to specific pieces or the whole game; similarly, a possible 
continue-captures constraint. The principal respects in which our class falls short 
of Pell’s are; initial placement of pieces must be predetermined by the game def- 
inition, so cannot be done randomly or by the players; no cylindrical boards; the 
decision on which piece to promote to must be made by the player (never the op- 
ponent); no retrieval captures; each piece can only have one capture effect (even 
if it can capture in several different ways). We argue that these shortcomings 
are minor. Our class is broad enough to include, for example, approximations to 
the following chess variations: International Chess, Chaturanga (the first known 
form of Chess), Shatranj (an old Arabic variant), various forms of medieval chess 
(similar to Shatranj), Courier Chess, Turkish Great Chess, Shogi, Tsui Shogi (or 
Middle Shogi, not in Pell’s SCL Metagame class (but we expect that could be 
easily changed)), Thai Chess, Capablanca’s Chess and Losing Chess. It contains 
many other games as well, including Draughts. 

3.1 Playing Performance 

In the following tables we show the performance of our learning methods on 
Chess, Losing Chess, Draughts and five games which were randomly generated 
by Pell’s game generator. These games may be reproduced from that generator 
using the seeds given in the following table: 



game 


seed 


Gamel 


rand(12123, 122,231) 


Game2 


rand(1938, 13844,9541) 


Game3 


rand(19333, 5115,4838) 


Game4 


rand(234,13,635) 


Game5 


rand(234,19,10) 



In each case, the depth d = 3, the number of linear equations used per 
iteration was 1000, and the Black player was arbitrarily chosen to be the one to 
do the learning. The ‘Baseline’ column gives results when neither player did any 
learning. This column reveals that some games are inherently biased in favour 
of one or other of the players, which must be taken into account when assessing 
the results of learning. 

Learning was performed by playing each game 1000 times. The number of po- 
sitions (and hence equations) produced depends on the lengths of the games, and 
1000 such positions were required for each learning iteration. Typically, playing 
a game 1000 times would produce around 20 learning iterations. After learning 
was completed, a single weight vector was produced using one of the three meth- 
ods of combining weight vectors mentioned at the end of our description of the 
learning method. Then this learnt weight vector was used to play the game 1000 
(or in some cases 500) times more to produce the final three columns of Table 2. 
This data indicates how well the learnt weight vector performs against the fixed, 
uniform-weights player. 
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Game 


Result 


Baseline 


Learning (by Black) 




(for 


(no learning) 


with combining technique . . . 




Learner) 




1 


2 


3 


gamel 


won 


31.9 


34.4 


27.8 


30.6 




drawn 


46.2 


41.4 


53.8 


46.4 




lost 


21.9 


24.2 


18.4 


23.0 




score: 


55.0 


55.1 


54.7 


53.8 


game2 


won 


6.3 


5.2 


5.0 


2.4 




drawn 


88.2 


93.4 


94.2 


96.2 




lost 


5.5 


1.4 


0.8 


1.4 




score: 


50.4 


51.9 


52.1 


50.5 


games 


won 


24.3 


25.8* 


25.4* 


23.6* 




drawn 


56.3 


56.2* 


56.6* 


58.4* 




lost 


19.4 


18.0* 


18.0* 


18.0* 




score: 


52.5 


53.9 


53.7 


52.8 


game4 


won 


32.2 


38.2 


44.8 


47.0 




drawn 


29.0 


26.6 


26.4 


27.6 




lost 


38.8 


35.2 


28.8 


25.4 




score: 


46.7 


51.5 


58.0 


60.8 


games 


won 


29.4 


41.4 


39.8 


40.8 




drawn 


34.8 


33.4 


32.6 


36.8 




lost 


35.8 


25.2 


27.6 


22.4 




score: 


46.8 


58.1 


56.1 


59.2 


Chess 


won 


31.2 


35.0 


26.4 


38.3 




drawn 


38.3 


40.6 


41.8 


39.2 




lost 


30.5 


24.4 


31.8 


22.5 




score: 


50.4 


55.3 


47.3 


57.9 


Lose- chess 


won 


51.3 


95.6 


78.1 


83.5 




drawn 


4.8 


2.2 


7.6 


5.5 




lost 


43.9 


2.2 


14.3 


11.0 




score: 


53.7 


96.7 


81.9 


86.3 


Draughts 


won 


39.0 






56.9 




drawn 


13.9 






24.0 




lost 


47.1 






19.1 




score: 


46.0 






68.9 



Table 2. Results from learning with 1000 equations per iteration, depth d = 3, 
with Black Learning. Won/drawn/lost figures are percentages, from 1000 games 
(or 500, if starred). Scores: for each game, win/draw/loss = 1.0/0.5/0.0 points; 
total then scaled to 100. 
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These results show that learning produced a significantly better player for 
Game 4, Game 5, Lose-chess, and Draughts. For Ghess, significant improvement 
was obtained with the third weight-combining technique, although the average 
performance over all three techniques is not signifcantly better. In Games 1, 2 
and 3, the Learning Player’s performance did not differ significantly from the 
baseline. Of the three weight-combining techniques considered, the third seemed 
to be the best, in that it always produced improved play (over baseline) whenever 
either of the other two did. Similar results, for the randomly-generated games, 
were obtained when we set the Learning Player to be White (instead of our usual 
choice of Black). 

A further 12 random games were generated by Pell’s system. Two were de- 
generate, in that one player always won. Of the other ten, there was very clear, 
significant improvement (over baseline play) in five, and no apparent improve- 
ment at all (but no worsening!) in the other five. Both these results, and the 
ones above, support the conclusion that our learning method produces signifi- 
cant improvement in play for about half the games generated randomly in Pell’s 
SGL class, and appears to make no difference for the rest. 

It is important for our learning methods that enough learning iterations 
are allowed. Results obtained using only 1/5 as many such iterations showed 
no evidence of learning, even though we were using ten times as many linear 
equations per learning iteration. 

We used search depth 3 because it was the highest we could afford, given 
the computational resources available and the number of experiments. Some 
experiments were conducted for search depths other than 3. Depth 1 was clearly 
too small. Learning with depth 1 gave improved play for Chess and Draughts, 
but made play much worse for Lose-chess and Game 3. The picture for depth 
2 was similar. Resource constraints did not permit much experimentation with 
depth 4, but learning did seem to improve play in Draughts. 

One problem with our learning method is that it occasionally diverges for 
some games. Advisor weights become poor (perhaps through random noise), 
then get progressively worse because applying the evaluation function (with the 
bad weights) to the leaves of the search tree sometimes causes the backed-up 
value to become even worse. When this occurs, it is generally observed that one 
advisor weight becomes more and more negative until reaching —1. It is nearly 
impossible for the learning method to recover from this situation. While the 
technique for combining weight vectors at the end of learning ignores the poor 
weight vectors, increased learning time has no effect. 

One possible remedy for this problem is to use the eventual outcome of a 
game to give an indication as to how reliable the backed up value is (suggested 
by C. S. Wallace (personal communication)). This could be done in a number 
of ways. It would appear that any method which assigns credit or blame for the 
outcome to earlier positions would need to take account of how far a position is 
from the end of the game. For example, in Tesauro’s application [16] of Sutton’s 
TD algorithm [15] to Backgammon, the credit due to a position for the result 
of a game decreases exponentially with distance from the end of the game. It is 
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not clear how to do such credit assignment for general games in the SCL class, 
since a critical error could occur at any point in a game. 

3.2 Advisor Weights 

The advisor weights learnt for these games are of interest in themselves. It should 
be born in mind that, since the advisors return values on difiFerent “scales”, 
their weights in a given game cannot be compared solely on the basis of their 
magnitude. 

Table 3 gives examples of weights learned by our method. The tendency of 
one advisor to dominate is noted. 



Game 


Material 


Static 


Eventual 


Dynamic 


Capture 


Possess 


Gthreat 


Arrive 


Geimel 


0.11 


0.10 


0.08 


0.10 


-0.09 


0.08 


0.36 


0.08 


Game2 


0.06 


0.09 


0.02 


0.05 


0.04 


0.67 


0.04 


0.03 


Game3 


0.09 


0.09 


0.05 


0.10 


-0.26 


0.22 


0.11 


0.07 


Game4 


0.02 


0.02 


0.04 


0.04 


0.00 


0.52 


0.02 


0.33 


Games 


0.06 


-0.12 


-0.09 


0.06 


0.12 


0.47 


0.07 


0.00 


Chess 


0.11 


0.18 


-0.02 


0.13 


0.00 


0.04 


0.47 


0.04 


Lose-chess 


-0.01 


0.19 


0.03 


0.01 


-0.67 


0.01 


0.07 


0.01 



Table 3. Examples of weights learned 



It is evident that some advisor weights in the Table are negative. This stands 
in contrast to Pell’s suggestion (see [11, 15.5.1.2] and [12, §2.3]) that weights 
should never be negative since advisors capture “. . . properties of a position 
which should be valuable to a player, other things being equal”. Pell recog- 
nised [12] that some advisors may be negatively correlated with success for a 
player, which appears to be the case for the Capture advisor in Lose-chess. In 
that particular case, it is easy to see why that advisor should receive a nega- 
tive weight, since threatening to capture would often help the opponent more 
than the player. Pell suggested that, when an advisor correlates negatively with 
success, another advisor should be sought which recognises why the opponent 
derives an advantage when the original advisor becomes more negative. 

Not all negative weights correspond to advisors which are negatively corre- 
lated with success, however. It can be shown that if several advisors which are 
constructive (i.e. positively correlated with success) are also highly correlated 
with each other, then it is quite possible that some (but not all) should be neg- 
atively weighted. It is the weighted combination of such advisors that should 
always be positive, not the individual weight x advisor-value products, and this 
was indeed found to be the case. High correlation suggests that some of the 
advisors may be redundant or nearly so. It follows also that, if we are setting 
advisor weights ourselves, we cannot use a single advisor’s weight, alone, to say 
how important that advisor is individually, independent of other advisors. 
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4 Conclusion 



Our method of unsupervised learning in Metagame, which is a refinement of 
that used in [2] for Chess and Noughts-and-Crosses, has been shown to improve 
play significantly for many Symmetric Chess-like games in our minor restriction 
of Pell’s class. For about as many games, however, it produces no discernible 
improvement with depth 3, though with that depth we find that it seldom makes 
things worse. Depth 3 appears to be the smallest depth at which our method 
produces improved play on average. It is of course likely that greater depth 
would produce more significant improvement. It would be interesting (although 
computationally expensive) to try this out, and see whether all SCL games can be 
learned with enough depth, or whether some games in the class are unlearnable 
by our methods, perhaps because our advisor set is not comprehensive enough. 
Another worthwhile extension to our approach would be to use a search method 
with nonuniform depth, such as quiescent search. This might give improved play 
without such a high computational cost. 

The advisor weights learned for the different games are of interest in them- 
selves. We have shed light on the appearance of negative advisor weights, which 
turns out not to be a problem after all, even when the advisor in question is 
constructive. In such cases negative weights are a manifestation of correlation of 
advisors, and weighted sums of correlated advisors will still be positive. 

Some advisors were found to be mostly ignored after learning for most 
games, but occasionally were important. This leads to the idea of dynamic in- 
clusions/exclusion of advisors during learning. If an advisor were found to be 
essentially useless for a particular game, then it could be removed to allow more 
time for learning the other advisors (if time limits were in effect). 

As suggested by Pell, it would be desirable to allow somehow for the auto- 
matic construction and incorporation of new advisors and therefore for more of 
the actual game analysis to be performed by the machine. In [12], Pell discusses 
accumulating advice from elementary avisors (such as, for example, positional 
indicator functions which simply indicate, for a given piece and square, whether 
that piece is on that square), and calculating some advisor weights directly 
from the game description, using subfeatures (rather than learning the weights 
through play), (metagamer’s automatic computation of piece values may be 
thought of as an example of this latter calculation.) These matters invite further 
exploration. 

We have concentrated on learning advisor weights. It would be interesting to 
learn values for other parameters which infiuence play. For example, there are 
parameters in Metagame which give the rate at which certain types of reward 
(e.g. for possible promotion, or control of a square) decline with distance [11]. 
For another example, consider the search strategy. Cron [3] introduced search 
advisors to control search in Metagame, and studied the effect of such advisors’ 
weights on play. 
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Abstract Data mining is about extracting hidden information from a large 
data set. One task of data mining is to describe the characteristics of the data 
set using attributes in the form of rules. This paper aims to develop a neural 
networks based framework for the fast mining of characteristic rules. The idea 
is to first use the Kohonen map to cluster the data set into groups with common 
similar features. Then use a set of single-layer supervised neural networks to 
model each of the groups so that the significant attributes characterizing the 
data set can be extracted. An incremental algorithm combining these two steps 
is proposed to derive the characteristic rules for the data set with nonlinear 
relations. The framework is tested using a large size problem of forensic data 
of heart patients. Its effectiveness is demonstrated. 



1. Introduction 

Data mining has been a new and exciting research field receiving an increasing 
attention. As a multidisciplinary research field, it requires knowledge of many 
information technology areas such as database, artificial intelligence, networking, 
information retrieval, computational intelligence and statistics. For data mining, the 
first task is to construct a model that represents a huge data set of interest fi*om a 
database. This enables data mining tasks to be done from top end by users without 
knowing details of the data. One common representation of data is by means of rules. 
There are mainly three kinds of rules; association rules, classification rules and 
characteristic rules, the latter is the one whidi is of interest to data miners [1,2]. The 
classification rules classify entities into groups by identifying common characteristics 
among the entities [8]. These common characteristics of a particular group are 
described by characteristic rules. A diaracteristic rule is an assertion that 
characterizes the concept satisfied by almost all the examples in the data set of 
concern. 

Extracting characteristic and classification rules has been a very active research 
topic. AI techniques are often used for data mining modeling tasks, for example, the 

N. Foo (Ed.); AI'99, LNAI 1747, pp. 36-47, 1999. 
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well known decision tree approach ID3 [15]. Decision trees are easy to construct and 
so is the rule extraction. However, its classification performance may be 
compromised because when the data set has complicated (nonlinear) domains and the 
mining goal is to assign each example in the data set to one of many categories, the 
decision tree approach has to generate many branches for each node which is 
computationally expensive and may result in a large minin g error. Another emerging 
AI tool is the neural networks that have been proven to be better than the decision 
tree approach when dealing with complicated (nonlinear) data sets. Supervised 
multilayer neural networks based on the backpropagation have been an important 
tool for data mining. Although they are robust with respect to noise, they suffer from 
slow learning and convergence [10]. Nevertheless, it is much efficient especially in 
dealing with numeric data sets in comparison with ID3. Improving learning speed is 
crucial in effective use of neural networks for data mining. 

The aim of this paper is to develop a neural networks based framework for mining 
characteristic rules with significantly improved learning and convergence speed. The 
framework contains a Kohonen map which will be used to decompose a large data 
set into a number of small-sized groups for fast mining. A single-layer neural 
network (SSNN) will be used to model each data group identified via the Kohonen 
map. It is well known that the SSNNs, although are simple to use, are poor in 
handling nonlinear and large data set. We will propose to use a set of SSNNs to 
model the data groups, each of which can be modeled as a linear model using the 
SSNN. We will then develop an incremental approach for mining characteristic rules 
(lAMCR) which combine the two steps mentioned above. Experimental study will be 
provided to show the effectiveness of the algorithms proposed. 

This paper is organized as follows. Section 2 introduces the data mining problem 
statement, the Kohonen map and the SSNN, which we will use for data mining 
throughout the paper. Section 3 presents the incremental approach for mining 
characteristic rules. Experiments are illustrated to show the effectiveness of these 
algorithms in Section 4. Finally, Section 5 gives some concluding remarks. 



2. The Problem Statement and the Kohonen Map and the SSNNs 

The problem of mining characteristic rules from a set of examples of the same class, 
denoted as G, can be formally stated as follows: Consider n attributes [ai,a 2 ^.--,a„ ) . 

Let D,- represent the set of possible values for attribute fl,. . We are given a large 

data set or database D in which each example is a /I tuple in the form (vj , V 2 , • • • v„ ^ 

where and (vi,V 2 ,---v„)gG , i.e. each example belongs to the same 

class G . The problem is to obtain a set of characterizing rules in a conjunctive 
normal form 



afA-’-Auj => G, where i ^ j and i, j ^n. 
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The Kohonen map [17] will be used to group data initially. The Kohonen map 
describes a mapping from the input data space R" onto a two dimensional array of 

neurons or nodes. A reference vector m, «■ associated with each 

neuron where are weights. The input vector, denoted as x = (xi,X2,—,x„y , 

where the superscript t denotes the transpose operation, is connected to all neurons. 
The teaching here is different from other neural networks in that it consists of 
choosing a winner neuron by means of similarity measure and updating the weights 
of neurons in the neighborhood of the winning neuron. The Euclidean distance is 
used as the similarity measure between the input vector and the neuron weight 
vector. The weight update rule for neuron is described as follows: 

=/nf +A* (jr, -mf) 

where k denotes time and h^. is a non-inaeasing neighborhood function around the 
winner neuron . The neighborhood function use is as follows: 

/t**'=/S*exp(-fc^) 

In this neighborhood function and are location vectors of the undergoing change 
of the neurons and the winning neuron. Two parameters P and o are used here. The 
first one refers to the learning rate and the later to the neighborhood and both are 
decreasing functions in time, that is 

Vi+^ Vi+^ 

Where Pq is the initial learning rate and ctq the initial neighborhood size chosen as 
the maximum radius of the Kohonen lattice. Pq is usually between 0 and 1. 

The SSNN we will use is the ADALINE (Adaptive Linear Neuron) which was 
developed by Widrow and Hoff [4]. The ADALINE, the building block for feed 
forward neural networks, is defined as 

y = w‘x 

where y is the output, with 



representing inputs and weights respectively. The learning algorithm, which is often 
referred to as the Widrow-Hoff delta rule [4], is 

= w* c* = (d, -(w*)'x) 

or in a modified form [7] 




if x^x ^ 0 
if x‘x = 0 



( 1 ) 



The learning rule (1) will be the main algorithm we use for training SSNN for data 
mining. 
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3. Using SSNNs for Mining Characteristic Rules 



Consider a class, G, with n attributes a where aE-R" . For given 

examples of G, we want to find out the characteristic rules that can characterize G. 
Hence the n attributes of each example become the inputs to the SSNN and the 
output of the SSNN is an indicator of the degree of belonging to G. Assume each 
attribute takes value of either 1 or 0. 

The characteristic rules can be found by selecting those attributes which have 
larger weights since the larger values indicate that these attributes have a more 
significant contribution towards the output than others. Since the SSNN does not use 
saturating functions, the weights truly reflect the characteristic nature of the 
particular attributes. 

For our mining purpose, we set the output as d,=i as the desired output value. Our 
target is to train the SSNN so that it can approximate the given data set. We want to 
obtain the characteristic rules in the form of a,A---Aay (i ^ j ^n) . 

We now develop the following mining algorithm for mining characteristic rules 
based on the SSNNs assuming the data set we are dealing with is linear so that it can 
be modeled by the SSNN. Given are P training pairs, {a^ ,a^ a' ER" and 
the desired output d =1, 

1. Initialize iteration - 1 , i = 0 , sum _ error = 0 , Aw = 0 and the random weights 



wER" . Set a,f > 0 and max_iteration ; 

2. The training cycle begins here. Input is presented and output computed: 
y = w‘ * a' 

3. Weights are updated as: 



If (a')'a' >0 



then Aw = alpha * [d - y] * 






\ 



Otherwise Aw = 0 
w = w + Aw 



sum _error = sum_error +{d - y)^ 

4. If sum _error > e and iteration <= max_iteraion , go back to step 2; 

5. Select the weights w4iich have a significant contribution to output d comparing to 
others; 

6. Construct the input-output relationship with selected weights; 

7. Extract rules from the input-output relationships by varying the value of input 
attributes. The combinations, which give an output close to 1 will be selected as 
characteristic rules; 

8. Optimize the rules using the constrained defined by the user. 



The algorithm can generate rules but it is not guaranteed that all the rules 
generated are strong and satisfy data mining requirements. Two concepts, 
Confidence and Support, are often used in the data mining literature for description 
of the strength of rules [5]. Since we have only a single class for the data set, the 
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measure of confidence is not meaningful. Hence, we only use the concept of support 
to identify strong characteristic rules. The concept Support is defined as: The rule 
x=> y has a support of s in data set Q if s percent of examples in Q contain x n y 
i.e. s = Ft obablity(x C\ y) . The level of support depicts the frequency of the rule 
found in the data set. Using this concept, the rules obtained in step 3 can be pruned 
using prescribed support measure to determine the dominant attributes of the class. 



4. Incremental Approach for Mining Characteristic Rules 

Since the SSNN is incapable of handling large nonlinear data alone, an incremental 
approach will be proposed in this section. TTiis approach makes use of piecewise 
local linearization idea in construction of a nemal network model to handle the large 
nonlinear data set. 

The SSNN model described in Section 3 is a single neuron structure, which can be 
considered as a local linear model for a nonlinear function. This model, in some 
sense, can only model a piece of the nonlinear function, which is, of course, 
insufficient to describe the entire nonlinear function. The piecewise local 
linearization idea has been used extensively in function approximation. Inspired by 
this idea, here we propose to use a set of SSNNs as local linear models to model 
“pieces” of the large nonlinear data set so that, as a whole, the entire large nonlinear 
data sets can be modeled and hence the characteristic rules can be extracted. It should 
be noted that such a structure for modeling a large nonlinear data set should give rise 
to a fast mining tool, evidenced by its simple structure, parallel implementation of 
SSNNs, and fast dynamic adjustments. 

We now propose the incremental approach for mining characteristic rules 
(LAMCR). The algorithm contains three phases: Pattern Qustering, Constructing a 
Set of SSNNs, Extracting Rules, and Forming Rules. In the Pattern Qustering phase, 
the given data is subject to the Kohonen map so that they can be clustered into 
poups with similar common features. This will result in reduction of the number of 
SSNNs for approximation. In the construction phase, clustered data sets are inputted 
to a single SSNN at a time for training so that a given tolerance level of training is 
satisfied. The remaining data examples will be fed to another SSNN for training, so 
on and so forth, until all the data are approximated The extracting rule phase is done 
by using the algorithm developed in Section 3. Since there is no guarantee of having 
the rules which satisfy the user defined support level, the Forming Rules phase is 
ended by investigating combinations of attributes so that representative characteristic 
rules can be formed. Details of the phases are given below. 



4.1 Pattern Clustering 

To reduce the number of SSNNs for mining, the examples should be first 
preprocessed and clustered. We propose to use the two dimensional Kohonen map 
with an assumed maximum number of clusters, Z. (For two-dimensional case, the 
index I is defined as I -{number of rows)x {number of columns) . The training 
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algorithm used is the one in [17], and the quantization error is used to measure the 
goodness of training. The neurons in the Kohonen map represent the centres of the 
cluster formed by the examples with some common features. All the examples are 
submitted again to the Kohonen map to determine grouping of examples, which can 
be ordered in a descending order of similarities between examples. 



4.2 Constructing a Set of SSNNs 

The construction of SSNNs for mining can be done as follows. With the P training 

pairs, A = {a',a^,...,a^}, a' EiR’' and the desired output d =1: 

a) Initialize the set of SSNNs, denoted as SSSNN , to empty 
and Training Pattern Set ~ A; Set also the SSNN parameters a,£ >0 and 
max_iteration ; 

b) Training SSNNs starts at here. Create a new SSNN with random weights wEi?" 
and put the SSNN into SSSNN ; 

c) Initialize NonTrained Pattern Set = empty and Trained Pattern Set - empty ; 

d) Take an example a' E.R" from Training Pattern Set and set sum_error - 0 ; 

e) Train the current SSNN with a' using the learning algorithm in Section 3 until 
sum _err or < e ox iteration > mdOi_iteraion ; 

f) IF sum _ err or < e Then put a' into the Trained Pattern Set , Otherwise put o‘ 
into the Non Trained Pattern Set ; 

g) If Training Pattern Set is not empty, go back to step d. Otherwise the training of 

the current SSNN terminates. Trained Pattern Set is the set trained by the current 
SSNN; 

h) IF Non Trained Pattern Set is not empty, set Non Trained Pattern Set to 
Training Pattern Set and go back to step b for the next SSNN, Otherwise training 
session of the algorithm ends. 



43 Extracting Characteristic Rules 

Before going to the rule extraction phase from trained SSNNs, we need to define the 
mining requirement which has to be satisfied by all mined rules. We define Local 
support and Global support as a means to measure the support level of an individual 
SSNN performance as well as the overall performance over the entire data set. We 
use the following heuristics for extracting rules. We only consider the rules which 
has local and global support levels higher than the minimum desired support level. 
This will significantly reduce the numbers of rules to be dealt with and speed up the 
extracting speed. The algorithm for extracting characteristic rules is shown below: 

a) Ixiiiiahze Char _Rule Set = empty &nd Desired_Support; 

b) Do the following steps until there is no SSNN in SSSNN; 
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c) Extract a SSNN from SSSNN and find the characteristic rule by inspecting 
significant weights; 

d) Calculate k = ^ w‘j where w‘j is j th significant weight of i th SSNN; 

e) IF |A: -l| s £ then mine a characteristic rule of the form Ax,j | where 
X,-. denotes lyth attribute; 

f) Calculate the local support and global support of the rule found in step e. If both 
supports is equal or larger than the Desired_Support put it in to 
Char _Rule _ Set , otherwise go to step e until all groups are considered. 

g) The Char_Rule_Set contains the desired mining result for the problem stated in 
section 2. 



5. Experimental Results 

In order to test the algorithms developed, we conducted a series of experiments with 
a forensic data of heart patients [16]. The data set was replicated up to the size of 537 
records, and each example has seven attributes: age, sex, chest pain, blood pressure, 
cholesterol, sugar level in blood and heart beats. We used a binary encoding system 
to code each attribute for use. The encoding scheme is summarized in Table 1. For all 
the experiments we trained the SSNNs to produce target output 1. The training 
parameters were or = 0.001, £» 0.0001 , max_frerafto/i = 50000 and minimum 
support =0.75. 



Input No. 


Attribute 


Coding rule 


Symbol 


1 


Age 


0 = age<40, l=age>=40 


a, 


2 


Sex 


0=female, l=male 




3 


Chest pain 


0=few times,l=many times 


fls 


4 


Blood pressure 


0=low, l=high 


«4 


5 


Cholesterol 


0=low,l=high 




6 


Sugar 


0=not significant in blood. 
Insignificant in blood 




7 


Heart beats 


0=normal rate, l=fast rate 


«7 



Table 1: Coding of the attributes 



We first mined the characteristic rules for heart patient data set using the 
algorithm discussed in section 3. It was found that a single SSNN modeled only 33 
examples with weights w = [0.1667,01.667,0.1667,0.1667, 0.1667, 0.0225,0.1667] 
and a training error 0.0001. From the weights we can construct the following rules: 
Rule 1: a^Aa 2 AaJAa,^Aa^Aa^ => G (Support 0.27374) 

Rule 2: a^Aa 2 Aa 2 Aa^Aa^Aa^Aaj => G (Support 0.04842) 
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It is obvious that above rules have a very low support level and no use in describing 
the example class. 

This problem was solved successfully using the lAMCR that considered all the 
examples. We set same parameters as stated before. The experiments were conducted 
in two ways. First, the experiment was done without using any clustering technique. 
We call the data set used in this experiment as an unordered data set. The result of 
this experiment is reported in Table 2 and Appendix A, which list the set of rules 
constructed from the SSNNs and their support levels respectively. The minimum 
support was set to 0.75 and the following rule was found 

a^Aa^Aa^ => G 

i.e. chest pain, blood pressure and cholesterol are the attributes as the main cause for 
the heart disease. In other words, the characteristic rule can be formulated as follows: 
chest pain A blood pressure A cholesterol => Heart patient 



SSNN# 








lESM 










Error 


1 


0.1667 


0.1667 


0.1667 




0.1667 


0.0225 








2 




0.2119 




MiWIl'l 


0.1763 


BliilVJrl 






IHI 


3 






MHWIIIIIB 


0.2000 


0.2000 


0.0196 


Wilfclililil 


88 


0.0001 


4 


■llWIIlItM 


0.2000 




0.1017 


0.2000 


0.0076 


liiKiliyiM 


^^3 


■IIIIIIIIB 


5 


MiHiiilil 


0.2000 


WilfJililil 


0.2000 


0.0000 


0.0423 




16 


liMtti 


6 


0.2000 


0.2000 


fciiiiliiitiM 


0.2000 


0.2000 


0.0000 


MOWiIUtl 


4 


0.0001 


7 


0.0000 


0.2000 


■iWtTtUB 


0.2000 


0.2000 


0.0000 




ma 


Em 


8 


0.0000 


0.2500 




0.2500 


0.2500 


0.0000 




4 


0.0001 


9 


■iiiiiiiiii 


0.2500 




0.2500 




0.0000 


MiWfilil 


3 




10 


0.2615 


0.2431 


0.2183 


0.2771 




0.0320 




BSl 




11 


0.1817 


0.1592 


0.3347 


0.4743 


0.2072 


0.0803 




■E3 


0.0001 


12 




■IIIIIIIHE 


0.2500 




0.2500 






Kl 


Em 


13 








0.2522 


0.2522 


0.0000 




Kl 


0.0001 1 


14 


fiTiHtl:! 


0.0000 




0.3333 


0.1301 


0.0000 


^iPc)c|c)e1 




Em 


15 


ICTia 


0.2031 


0.4118 


0.3834 


0.2047 


■iniimti 


KSj^ 


K1 


EESSa 


16 




0.2211 


0.3750 


0.3467 




0.0668 


Mlllkjril; 


18 


0.0001 1 


17 


0.1435 


0.1125 


0.3333 


0.3333 




Wililililil 




9 




18 


0.3971 


0.0000 




0.1915 


0.2057 


MilBH 


KE&liSI 


4 


0.0001 1 


19 


0.3333 


0.0000 


■tUtfcfcfcl 


0.3333 


0.0000 


0.0000 


WifilililiM, 


1 


liliMIl 



Table 2. Results of lAMCR using unordered data sets. 

Second, two experiments were done to investigate the effect of clustering on 
mining. The first experiment used the reasonably ordered data set formed from a 2x2 
Kohonen map. The clustering result of this experiment is reported in Table 3. After 
training, the quantization error was found to be 0.652, which refers to the average 
difference between the example features and weights of cluster centroid. 333 
examples were clustered in one Kohonen neuron and the rest 204 examples in the 
other neuron. The order of examples was arranged as being the 333 examples from 
the first neuron first, then followed by the 204 examples from the second neuron. 
We call this data set as a reasonably ordered data set as the clustering was done in a 
restricted size Kohonen map. The result using the lAMCR with this reasonably 
ordered data was reported in Table 4 and Appendix B. The mining result is the same. 
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Exp. 


No. 

cluster 


Quantization. Error 


Ouster size in order 


1 


2 


0.65236210 


333,204 


2 


22 


0.01686998 


168,1 18,61 33,32,21 ,20,19,12,1 1 ,8,6,6,4,3,2,2,2,1 ,1,1 



Table 3. Results of clustering. 



The second experiment was done similarly except the size of Kohonen map was 
increased to (50x50) so that we form an almost ordered data set from groups. The 
result of this Kohonen clustering is shown in Experiment 2 in Table 3. It is obvious 
that it has a very small quantization error, indicating it clustered data well. The result 
of the experiment is reported in Table 5 and Appendix C. Nevertheless, We have the 
same mining result as before. The only difference is in the number of SSNNs used in 
the experiments and intermediate rules produced by them. From Table 2, 3 and 4, one 
can conclude that the almost ordered data set takes less number of SSNNs and hence 
the rule mining process is faster. 



SSNN# 








Eszn 






EZiH 


ILwinl 




1 


0.1667 


0.1667 


0.1667 


0.1667 


0.1667 


0.0180 


0.1667 


147 


■imimll 


2 




0.2000 


0.2000 


0.2000 


0.2000 


0.0070 


0.2000 


mm 




3 


liHililtM 


0.0000 


0.2000 


0.2000 


0.2000 


0.0357 


0.2000 




0.0001 


4 




0.0000 


0.2500 


0.2500 


0.2500 


0.0039 


0.2500 


56 


liMim 


5 


gllir-ltltlllM 


0.2000 


0.2000 


0.2000 


0.2000 


0.0064 


0.0000 


53 


liSisiiUi 


6 




0.2500 


0.2500 


0.2500 


0.2500 


0.0000 


0.0000 


■DEI 


lilMI 


7 


0.2500 


0.1143 


0.2500 


0.2500 


0.0000 


0.0226 


0.2500 


HEI 


0.0001 


8 




0.0369 


0.3333 


0.3333 


0.0000 


0.0000 


0.3333 


23 




9 








0.2545 




0.0586 


0.2545 


ma 


EEMI 




0.1955 






0.2576 


0.2560 


0.0274 


0.3244 


m 


0.0001 


11 


0.2500 


0.2500 




0.2500 




0.0000 


itiiiiiiM 


mm 




12 




0.3333 


0.3333 


0.3333 




0.0000 


0.0000 


5 


0.0001 


13 




0.0000 


0.2500 


0.2500 


■IIIMIKM 


0.2500 


0.0000 


ma 


lifiMiTl 


14 


■IlftWIIlM 


0.0000 


0.2500 


0.2500 


MIBMailM 


0.0000 


0.0000 


5 




15 




0.0946 


0.2217 


0.1674 


0.3891 


0.0000 


0.0000 


3 


■iiiiinia 


16 


■irfiTitiiiM 


0.0000 


0.3333 


0.3333 


0.3333 






1 


EEMI 


17 




0.0000 




mm 


0.0000 


0.0000 


0.0000 


2 


0.0001 


18 


■IKIIIIIIIM 


0.0000 


0.5000 


0.0000 


0.0000 


0.0000 


0.5000 


1 


liTililiHI 



Table 4. Results of lAMCR using reasonably ordered data sets. 











W3SSM 










Error 


1 






0.2500 


0.2593 




0.0144 


■iKllttn.-J 


262 




2 


■IIlitJCM 










0.0337 


■iMW 


136 


EHl 


3 




0.1365 


0.2399 


0.2419 




0.0876 




■Bl 


0.0001 


4 


0.3163 






0.2833 




0.0572 


0.2833 


m 


■iiniiilia 


5 


0.2962 


0.0000 


0.2962 


0.2962 


■illWWM 


0.1386 


0.1115 






6 




0.1845 


0.2233 


0.4077 


0.2233 


0.0000 


0.4077 




Mikiiiiiil 


7 




0.0000 


iiMfcia 


0.4968 




0.0000 


0.3135 


m 


BKMI 


8 




0.0000 


0.3333 


0.3333 


liUlrJM 


0.0000 


0.3333 


ma 


■iiiiiiiii 


9 




0.0000 


0.3333 


0.0000 


ItKitimlh 


0.0000 


0.0000 


1 


Kitiiiiiil 


10 






0.5000 


0.0000 




0.0000 


0.5000 


1 


li|i!i!in 



Table 5. Results of lAMCR using almost ordered data sets. 
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6. Conclusion 

In this study, we have shown how to use single layered supervised neural 
networks to form an effective mining model for mining characteristic rules, so that 
the learning speed is much improved. Tfie multi layer supervised neural networks 
based on the backpropagation is very powerful in solving many complex problems. 
But until today, it has not been widely used in data mining partly due to its time 
consuming training process. The algorithms proposed in this paper dose not suffer 
from this weakness 
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Appendix A: Rules by lAMCR Using the Unordered Data Set 



Rule No 


Rule Description 


Produced By 


Local Support 




1 


Al'a2''a3''a4''a5"a7 


SSNN#1 


1.00000 


0.27374 


2 








0.31099 


3 


Ara2''a4''a5''a7 


SSNN#2 


0.89474 


0.29795 


4 


A2''a3''a4''a5''a7 


SSNN#2 


0.77193 


0.36499 


5 






1.00000 


0.36499 


6 


Al''a2^a3^a5^a7 


SSNN#4 


1.00000 


0.28864 


7 


Al'a2^a3'a4"a7 


SSNN#5 


■■nSTiTiTm 




8 


Al^a2^a4''a5''a7 


SSNN#6 


1.00000 


0.29795 


9 


A2''a3''a4^a5‘a7 


SSNN#7 


1.00000 


0.36499 




A2''a4''a5''a7 


SSNN#8 


1.00000 


0.39665 


11 


a2''a3"a4''a7 


SSNN#9 


1.00000 


0.40782 


12 


al''a2''a3"a4 


SSNN#10 


0.76667 


0.43762 


13 


a3"a4''a5"a7 


SSNN#12 


1.00000 


0.62384 


14 


ara3''a4''a5 


SSNN#13 


0.97368 


0.54749 


15 


al*a3''a4"a7 


SSNN#13 


0.94737 


0.48976 


16 


al'a3''a5''a7 


SSNN#13 


0.94737 


0.44320 


17 


al''a4''a5''a7 




0.97368 


0.45996 


18 


a3''a4''a5''a7 


SSNN#13 


0.94737 


0.62384 


19 


a3''a4"a7 


SSNN#14 


1.00000 


0.72812 




a2''a3*a4 




0.77778 


0.56238 


21 


a3'a4''a5 


SSNN#17 


1.00000 


0.76350 


22 


ara3"a4 


SSNN#19 


1.00000 


0.66108 
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Appendix B: Rules by lAMCR Using the Reasonably Ordered Data Set 



Rule No 




Produced By 






1 


ara2''a3''a4''a5'a7 


SSNN#1 


1.00000 


0.27374 


2 


a2''a3''a4-'a5''a7 




1.00000 


0.36499 


3 


ara3''a4^a5''a7 




1.00000 


0.42831 


4 


a3''a4"a5^a7 


SSNN#4 


1.00000 


0.62384 


5 


ara2'a3^a4^a5 


SSNN#5 






6 


a2''a3"a4''a5 


SSNN#6 


1.00000 


0.48231 


7 


al''a3''a4''a7 


mmam 


1.00000 


0.48976 


8 


a3"a4''a7 






0.72812 


9 


ara2''a4''a5 


SSNN#9 


0.76923 


0.39851 


10 


al"a2''a4''a7 


SSNN#9 


0.92308 


0.33892 


11 


ara2''a5''a7 


SSNN#9 


0.76923 


0.31285 


12 


al"a4''a5''a7 


SSNN#9 


0.84615 


0.45996 


13 


a2''a4-'a5''a7 


SSNN#9 


0.76923 


0.39665 


14 


ara2''a3''a4 


SSNN#11 


1.00000 


0.43762 


15 


a2''a3''a4 


SSNN#12 


1.00000 


0.56238 


16 


al-'a3''a4''a6 


SSNN#13 




0.12849 


17 


al''a3''a4''a5 


SSNN#14 


1.00000 


0.54749 


18 


a3''a4''a5 


SSNN#16 


1.00000 


0.76350 


19 


ara3''a4 


SSNN#17 








a3''a7 


SSNN#18 


1.00000 


0.74488 



Appendix C: Rules by lAMCR Using the Ordered Data Set 



Rule No 




Produced By 


Local Support 


Global Support 


1 


al''a2''a3''a4 


SSNN#1 


0.87405 


0.43762 


2 


al''a2"a4''a5 


SSNN#1 


0.77099 


0.39851 


3 


al''a3''a4''a5 




0.75191 


0.54749 


4 


a2''a3''a4''a5 




0.82443 


0.48231 


5 


a3''a4'a5 


SSNN#2 


0.91176 


0.76350 


6 


a3''a4^a7 


SSNN#2 


0.81618 


0.72812 


7 


a3-'a4''a7 


SSNN#6 






8 


a4^a5''a7 


SSNN#6 


0.93750 


0.66294 


9 


a3-'a4''a7 


SSNN#8 


1.00000 


0.72812 


10 


ara3''a5 


SSNN#9 


1.00000 


0.56611 


11 


a3*a7 


SSNN#10 


1.00000 


0.74488 



Indicates mined characteristic rules. 
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Abstract In this paper, a fast adaptive neural regression estimator named 
FANRE is proposed. FANRE exploits the advantages of both Adaptive 
Resonance Theory and Field Theory while contraposing the characteristic of 
regression problems. It achieves not only impressive approximating results but 
also fast learning sp>eed. Besides, FANRE has incremental learning ability. 
When new instances are fed, it does not need retrain the whole training set. In 
stead, it could learn the knowledge encoded in those instances through slightly 
adjusting the network topology when necessary. This characteristic enable 
FANRE work for real-time online learning tasks. Experiments including 
approximating line, sine and 2-d Mexican Hat show that FANRE is superior to 
BP kind algorithms that are most often used in regression estimation on both 
approximating effect and training time cost. 



1. Introduction 

Adaptive Resonance Theory (ART) [1] is an important family of competitive neural 
learning model. Its memory mode is very similar to that of biological one, and 
memory capacity can increase while the learning patterns increase. It can perform 
real-time online learning, and can work under nonstationary world. Field Theory [2] 
is named from CPM (Coulomb Potential Model) [3]. It can perform real-time one 
pass supervised learning with fast speed, and no spurious responses will be produced 
regardless of the number of memories stored in the network. We have proposed a 
neural network classifier based on ART and Field theory, which adiieved preferable 
results than several other neural algorithms [4]. 

There are lots of regression problems occurring in financial, decision and 
automation fields, such as the auto generating of stock price curvilinear, the 
predicting of the moving direction and extent of manipulator, etc. Preferably solving 
those problems will not only bring great economical benefit but also accelerate the 
technical progressing in those fields. However, although the neural algorithms 
designed to solve classification tasks have been deeply studied, the research on neural 
regression estimators is in deficiency. The output components of classification tasks 
are discrete, but that of regression tasks are continuous. Since the discrete nature, 
classification algorithms cannot get smooth approximating result while being applied 
to regression problems. Thus, devising effective neural regression estimator has 
become urgent affairs at present. 

N. Foo (Ed.): AI'99, LNAI 1747, pp. 48-59, 1999. 
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In this paper, a neural regression estimator FANRE, which organically exploits the 
advantages of both ART and Field Theory while contraposing the characteristic of 
regression problems, is proposed. FANRE needs only one pass learning, and achieves 
not only impressive approximating effect but also fast learning speed. The learning of 
FANRE is performed in an incremental style. When new instances are fed, it does not 
retrain the whole training set as most feed-forward algorithms do. In stead, it could 
learn the knowledge encoded in the instances through slightly adjusting its topology, 
that is, adaptively appends one or two hidden units and some connections to the 
existing network when necessary. Moreover, since the network architecture of 
FANRE is adaptively set up, the disadvantage of manually determining the number of 
hidden units of most feed-forward networks is overcome. Experimental results show 
that FANRE is superior to BP kind algorithms that are most often used in regression 
estimation on both approximating effect and training time cost. 

The rest of this paper is organized as follows. In Section 2, we detailedly describe 
and analyze the FANRE algorithm; in Section 3, we report on experimental results 
and comparisons on three function approximating problems against a BP kind 
algorithm; finally in Section 4, we conclude and indicate some directions for future 
work. 



2. FANRE Algorithm 

To feed forward neural networks, the hidden units in single hidden layer architecture 
are often too tightly coupled to make it possible that the improvement of the 
approximation on some points does 
not result in the deterioration on other 
ones. Comparatively, the hidden units 
in two hidden layer architecture are 
relatively loose coupled, and 
corresponding sub-regions could be 
adjusted independently. This enable 
the latter architecture achieves better 
results than the former while dealing 
with regression tasks. Considering 
that, FANRE adopts two hidden 
layers, which perform internal 
approximating corresponding to input 
and output patterns respectively. 

Fig.l shows its architecture. 

Except the connections between the first and second layer units, all connections of 
FANRE are bi-directional. The feedback connections, whose function is just 
transmitting feedback signal to implement resonance, are always set to 1.0. 

The initial network is composed of only input and output layers, whose unit 
number is respectively set to the number of components of the input and output 
pattern. In particular, the unit number of hidden layers is zero. This is different to 
some other neural algorithms that configure hidden units before the start of the 




fourth layer units 
(output units) 

third layer units 
(hidden units) 
second layer units 

first layer units 
(intput units) 



Gaussian weights 
Fig. 1. The architecture of FANRE 
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learning course. When new instances are fed, FANRE will adaptively append hidden 
units and connections so that the knowledge encoded in those instances could be 
learned. The unit-appending process terminates after all the instances are fed. Thus, 
the topology of FANRE is always adaptively changing during the learning course. 

When the first instance is fed, FANRE appends two hidden units to the network, 
one in the second layer and the other in the third layer. Those two units are connected 
with each other. The feed forward and feedback connections between them are all set 
to 1.0. The third layer unit is connected with all the output units, the feed forward 
connections are respectively set to the output components of current instance, the 
feedback connections are all set to 1.0. The second layer unit is connected with all the 
input units through Gaussian weights. The response-centers are respectively set to the 
input components of current instance, and the response-characteristic-widths are set to 
a default value. 

FANRE introduces the notion of attracting basin, which is proposed in Field 
Theory [2]. Each second layer unit of FANRE defines an attracting basin by response- 
centers and response-characteristic-widths of Gaussian weights connecting with it. 
Thus, FANRE constructs its first attracting basin according to the first instance. And 
it will add or move basins according to the later instances. 

Assuming that instances fed to the input units are AtD(a*^ □«/□... Da*") 
(A:D1D2D... D/n). Where ^ is the index of instance, and n is the number of input units. 
The value input to the second layer unit j fi-om the first layer unit i is: 



blHij = e 




( 1 ) 



Where and Oij are respective the response-center and the response-characteristic- 
width of the Gaussian wei^t connecting unit i with unit j. 

Since the dynamical property of a Gaussian weight is entirely determined by its 
response-center and response-characteristic-width, learned knowledge can be encoded 
in the weight through only modifying those two parameters. Thus, during the training 
process, if the input pattern is near to an existing attracting basin that is determined by 
6ij’s and 0 {/s of Equation 1, the basin will be slightly adjusted so that it could cover 
the input pattern. Else a new basin will be established, whose 6(y’s and oi/s ensure that 
the input pattern is covered. 

The second layer unit j computes its activation value according to Equation 2: 

( 2 ) 

Where 6j is the bias of unit j.f is Sigmoid function shown in Equation 3: 



1 



( 3 ) 
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A leakage competition* is carried out among all the second layer units. The outputs 
of the winners are transferred to the related third layer units. The activation value of 
the third layer unit h is computed according to Equation 4; 

A (4) 

c* = /(|j V;* 

Where bj is the activation value of the second layer unit j, which is not only a 
winner in its competition but also connecting with unit h. o is the number of the 
second layer winners that connecting with unit h. Vjh is the feed forward weight 
connecting unit; to unit h. Attention should be paid to that v/* is always 1.0. 9h is the 
bias of unit h.f is still Sigmoid function. 

Then, a leakage competition is carried out among all the third layer units that 
receiving inputs from the second layer units. Attention should be paid to that there 
might exist some units that are not qualified to attend the competition. The reason is 
that the attracting basins determined by the second layer units connecting with them 
are far from current instance. And there is no hope to cover the instance through only 
slightly adjusting those basins. In other words, the third layer unit connecting with no 
second layer winner is not qualified in the third layer competition. 

The activation values of the third layer winners are enlarged to N times and 
transferred to the output units. The activation value of the output unit I is computed 
according to Equation 5: 

, 1 ^ \ (^) 
qM 

Where di is the activation value of the output unit /. c* is the activation value of the 
third layer unit h, which is a winner in its competition, w/,; is the feed forward weight 
connecting unit h to unit /, q is the number of the third layer winners. 

The reason that we enlarge the activation values of the third layer winners to N 
times is that the value of c* which is attained through twice Sigmoid transforming is 
relatively small. If it is directly send to the output units, the difference between the 
real network output and expected output wiU be quite large, which will result in the 
appending of hidden units unnecessary. Experiments show that iV=10 could generate 
satisfying results. 

If the network is not in training, the approximation result is attained from Equation 
5. Else three pre-set parameters are used, namely the maximum allowable error 
Err^ax, the first-degree vigilance Vigi and the second-degree vigilance Vig 2 . Vig\ and 
Vig 2 are used to control the unit-appending process. Those parameters satisfy 
Equation 6: 

Err„ax < Vigx < Vig2 (6) 

The error between real network output and expected output is computed. Here we 
use the average squared error as the measure, w4iich is shown in Equation 7: 



* In leakage competition, if the activation value of a unit is greater than a certain threshold, it 
will be a winner. So, there may exist more than one winner at the same time. 
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Err = ^^(d,-d^J 

Where n is the number of the output units, di is the real output of the output unit Z, 
dt is its expected output. 

If the error Err is less than Err^ix, it means that an existing attracting basin covers 
current instance, and the approximating effect is satisfying. No adjustment is 
necessary in this situation. 

If Err is beyond Err„ax, it means that current instance is not covered by any 
existing attracting basins. Thus, the topology-adjusting phrase is encountered. In this 
phrase, we must find out whether current instance could be covered through slightly 
adjusting some basins, or it is necessary to construct a new basin according to the 
instance. Furthermore, if it is the latter situation, we must find out whether we could 
exploit existing internal output approximation represented by the existing third layer 
units or not. 

If Err is beyond Err„ax but less than Vigi, it means that although the overall 
approximating performance is not satisfying, the internal approximations of current 
input and output pattern are still valid to some extent. In this situation, an existing 
basin could cover current instance through only fine adjustment. Thus, the third layer 
unit u that has the maximum activation value is found out according to Equation 8: 

c,-MAXk) 



Where Ch is the third layer unit activation value that computed according to 
Equation 4, q is the number of the third layer winners. 

Unit u releases a stimulus signal and feeds it back through feedback connection to 
the second layer unit t that has the maximum activation value, which satisfies 
Equation 9: 

" Z \ (9) 

b=MAX[bA 

‘ 7-1 ' 



Where bj is the second layer unit activation value that computed according to 
Equation 2, o is the number of the second layer winners that connecting with unit u. 

The response-centers and response-characteristic-widths of the Gaussian weights 
connecting with unit t are repeatedly adjusted according to Equation 10 and 11 until 
Err is less than Err^. The effect of Equation 10 is to move the center of the 
attracting basin toward current instance, and the effect of Equation 11 is to expend the 
verge of the basin toward current instance. 




( 10 ) 
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Attention should be paid to that the adjustment of the attracting basin is a resonant 
process, in which the second layer and the third layer of FANRE are respectively 
corresponding to the feature representation field and the category representation field 
of ARTl [5]. The analog of the ARTl’s top-down learned expectation in FANRE is 
the expected activation value of the third layer unit u. And the analog of the ARTl’s 
bottom-up information in FANRE is the feed forward value of the second layer unit t. 

In the beginning of the adjustment, the attracting basin corresponding to unit t is 
adjusted according to Equation 10 and 11. After that, unit t transfers its activation 
value to unit u. Simultaneously, the output units provide unit u an expected activation 
value which enables Err less than Err^. If the real activation value of unit u cannot 
match with the expected value, unit u releases a signal and feeds it back to unit t 
through feedback connection. Then, unit t performs another adjustment, and the 
resonance occurs. This resonant process terminates only when the real activation 
value of unit u matching with its expected value, that is, the bottom-up information 
matched with the top-down learned expectation. 

The adjustment of attracting basin described above involves not only feedback 
signals but also iterative modulating. However, since the input pattern becomes more 
and more close to the selected attracting basin as adjusting continues, the adjusting 
resonance is due to stabilize at a point where the input pattern is covered by the basin. 
This stabilization property is an advantage that FANRE inherits from Adaptive 
Resonance Theory. 

If Err is beyond Vigx but less than Vigi, it means that the internal output 
approximation represented by unit u is applicable to current instance. And it is the 
internal input approximations represented by the existing second layer units unfit for 
current instance. Thus, a new unit is appended to the second layer. It is connected 
with not only unit u but also all the input units. The feed forward and feedback 
connections between the new unit and unit u are all set to 1.0. The response-centers of 
the Gaussian weights connecting with the new unit are respectively set to the input 
components of current instance. And the response-characteristic-widths are set to a 
default value. If Err is still beyond Err max after the appending, it means that the basin 
is somewhat deviating from its typical attractor. The basin will be moved according to 
Equation 12 until Err is less than Errmax or a pre-set maximum moving step r is 
arrived. 

0<d<l (12) 



Where <5 is the response-center moving step. Experiments show that &=0.1 and r=2 
could achieve satisfying results. 

If Err is beyond Vig 2 , it means that both the internal input approximation and the 
internal output approximation represented by the existing hidden units unfit for 
current instance. Thus, two units are appended to the hidden layers, one in the second 
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layer, the other in the third layer. The new second layer unit is connected with all the 
input units. The response-centers of the Gaussian weights are respectively set to the 
input components of current instance. And the response-characteristic-widths are set 
to a default value. The new third layer unit is connected with all the output units. The 
feed forward connections are respectively set to the output components of current 
instance. And the feedback connections are all set to 1.0. Besides, the two new units 
are connected with each other. Both the feed forward and feedback connections 
between them are set to 1.0. If Err is still beyond Err^ after the appending, the 
corresponding attracting basin will be moved according to Equation 12 until Err is 
less than Err„ax or a pre-set maximum moving step r is arrived. 

The process of learning an instance fed to FANRE is accomplished hereunto. If 
there is instance that has not been fed, FANRE starts to deal with it. Or else the 
learning course terminates. Thus it can be seen that the learning of FANRE is 
performed in an incremental style, and the instances are fed in only one pass. Figure 2 
shows the flowchart of the learning course. 




Fig. 2. Flowchart of the learning course of FANRE 
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3. Experimental Results and Comparisons 

3.1 Methodology 

We have performed some experiments to compare FANRE against a BP kind 
algorithm on three regression problems with ascending difficulty. The machine we 
used is Pentium MMX 200MHz, 32M RAM. The parameters of FANRE are set as 
follows. The default response-characteristic-width Oij of Gaussian weight is set to 0.1. 
The bias of the hidden layer units is set to 0.01. The threshold of the leakage 
competition of the hidden units is set to 0.6. The response-center moving step 8 is set 
to 0.1, and the maximum moving step r is set to 2. The maximum allowable error 
Err„ax is set to 0.03. The first-degree vigilance Vigi is set to 0.04, .and the second- 
degree vigilance Vigj is set to 0.05. 

Considering that Backpropagation is the most prevailing neural algorithm in 
dealing with regression tasks at present, an algorithm belonging to this kind is tested 
and compared against FANRE. The algorithm we selected is SuperSAB [6], which is 
a faster variation of Backpropagation. Tollenaere [6] reported that it is 10 to 100 times 
faster than standard BP [7]. In our experiments, the weight step ? 7 y of SuperSAB is set 
to 10; the weight increase factor rjup and the weight decrease factor rj^own are 
respectively set to 1.05 and 0.2. In order to avoid overfitting, the training process is 
terminated after 500 epochs. 

The measure we used in the comparison is the classical regression measure [8], 
that is, variance, which represent the average squared distance between the real point 
and the expected point. The mathematical form of variance is the same as Equation 7, 
where di denotes the real output and dt denotes the expected output. The value range 
of variance is [0, 1], and the smaller the variance, the better the approximating result. 



3.2 Approximating Line 

In this experiment, we apply both FANRE and SuperSAB to approximate three lines, 
namely y=x, y=Q5x and y=0.3x while xQ[0, 1]. For each line, we uniformly sample 
200 points, in which 50 points compose the training set and all the 200 points 
compose the test set. Table. 1 shows the experimental results. 





Func 


Training set 
Variance 


Test set 
Variance 


Training Time 
(second) 




y=x 


0.000190 


0.000278 


0.020 


FANRE 


y=0.5x 


0.000447 


0.000492 


0.020 




y=0.3i* 


0.000461 


0.000465 


0.020 




y=x 


0.001330 


0.001459 


2,542 


SuperSAB 


y=0.5x 


0.001344 


0.001469 


2,542 




y=0.3x 


0.001283 


0.001406 


2,542 



Table 1. Experimental result of Line approximation 

From Table. 1 we can see that both the test set variance and the training set variance 
of FANRE are less than 0.0005, which means the average absolute distance between 
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the real point and the expected point is less than 0.023. Both the test set variance and 
the training set variance of SuperSAB are less than 0.0015, which means the average 
absolute distance between the real point and the expected point is less than 0.039. 
Thus, FANRE is slightly better than SuperSAB on the approximating effect although 
both algorithms achieve satisfying results. Moreover, the training time of FANRE is 
always about 5 magnitude less than that of SuperSAB. So, we conclude that FANRE 
is superior to SuperSAB on this problem. Fig.3 shows the approximating results on 
y=x using 200 test instances, that is, the lines are depicted with 200 points. 
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a)Prototype of b)Approximating c)Approximating 

line function result of FANRE result of SuperSAB 

Fig. 3. Comparison of the approximating results on y=x while xD[0,l], 
using 200 test points 



3.3 Approximating Sine 

In this experiment, we apply both FANRE and SuperSAB to approximate the sine 
function y=Sin(x), while xD[0, 2jt]. We uniformly sample 300 points from the curve, 
and construct 5 training sets respectively comprising 50, 100, 150, 200, 250 points 
while smaller one is the subset of bigger one. All the 300 points compose the test set. 
Table.2 shows the experimental results. 





Training set 
Size 


Training set 
Variance 


Test set 
Variance 






50 


0.000000 


0.000244 






100 


0.000167 


0.000312 




FANRE 


150 


0.000139 


0.000147 






200 


0.000201 


0.000224 






250 


0.000239 


0.000241 






50 


0.031475 


0.031671 


2,536 




100 


0.026842 


0.026926 


5,045 


SuperSAB 


150 


0.027568 


0.027606 


7,547 




200 


0.027089 


0.027115 


10,053 




250 


0.027019 


0.027038 


12,563 



Table 2. Experimental results of Sine approximation 

From Table.2 we can see that FANRE achieves better results than SuperSAB no 
matter what size the training set is. Both the test set variance and the training set 
variance of FANRE are less than 0.0025, which means the average absolute distance 
between the real point and the expected point is less than 0.05. Both the test set 
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variance and the training set variance of SuperSAB are greater than 0.026, which 
means the average absolute distance between the real point and the expected point is 
greater than 0.162. Moreover, the training time of FANRE is always about 5 
magnitude less than that of SuperSAB. So, we conclude that FANRE is superior to 
SuperSAB on this problem. 

Attention should be paid to that the relatively worse result of SuperSAB is partially 
due to the training strategy we used, that is, we terminate the training course after 500 
epochs in order to avoid overfitting. We believe that if we can continue its training 
and skillfiilly avoid overfitting, SuperSAB could attain better results. However, even 
if we can improve the approximating results of SuperSAB to the same level of that of 
FANRE, we still regard the latter superior to the former because the training time of 
the latter is far less than that of the former. Fig.4 shows the approximating results 
using 250 training instances and 300 test instances, that is, the curves are depicted 
with 300 points. 




a)Prototype of b)Approximating c)Approximating 

Sine function result of FANRE result of SuperSAB 



Fig. 4. Comparison of the approximating results on y=sin(x) while xD[0,2ji], 
using 300 test points 



3.4 Approximating 2-d Mexican Hat 



In this experiment, we apply both FANRE and SuperSAB to approximate the 2-d 
Mexican Hat function that is shown in Equation 13, while xD[-4ji, 4ji]. 



y = sincx 



sinx| 

H 



(13) 



We uniformly sample 1,000 points from the curve, and construct 5 training sets 
respectively comprising 100, 200, 400, 600, 800 points while smaller one is the subset 
of bigger one. All the 1,000 points compose the test set. Table.3 shows the 
experimental results. 

From Table.3 we can see that approximating 2-d Mexican Hat is a somewhat 
difficult problem where SuperSAB achieves poor performance. Its test set variance is 
always greater than 0.125, which means the average absolute distance between the 
real point and the expected point is greater than 0.354, no matter what size the 
training set is. However, FANRE still achieves satisfying result on this problem. Its 
test set variance is always less than 0.0043, which means the average absolute 
distance between the real point and the expected point is less than 0.066. Moreover, 
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the training time of FANRE is about 4 magnitude less than that of SuperSAB. So, we 
cMnclude that FANRE is superior to SuperSAB on this problem. 





Training set 
Size 


Training set 
Variance 


Test set 
Variance 


Training time 
(second) 




100 


0.002295 


0.002375 


0.091 




200 


0.001883 


0.002462 


0.230 


FANRE 


400 


0.003202 


0.003474 


0.751 




600 


0.003875 


0.004289 


1.543 




800 


0.002733 


0.003107 


2.634 






0.112657 


0.142652 


5,923 






0.099063 


0.125969 




SuperSAB 




0.123704 


0.153696 






600 


0.145957 


0.165955 






800 


0.137124 


0.151246 





Table 3. Experimental results of 2-d Mexican Hat approximation 



As the same reason expatiated in Section 3.3, although we may improve the 
approximating effect of SuperSAB, we still regard FANRE superior to it because the 
approximation of FANRE is already quite good and the training time of FANRE is far 



less than that of SuperSAB. Fig.5 
shows the approximating results 
using 800 training instances and 
1,000 test instances, that is, the 
curves are depicted with 1,000 
points. It is obvious that FANRE 
approximates the prototype curve 
quite close while SuperSAB is 
nearly no use. 
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a)Prototype of 2-d Mexican Hat function 
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b)Approximating result of FANRE c)Approximating result of SuperSAB 

Fig. 5. Comparison of the approximating result on y=sinc|x| while xD[-4ji,4jt], 
using 1,000 test points 



4. Conclusions and Future Work 



This paper proposes a fast neural regression estimator named FANRE, which exploits 
the advantages of both Adaptive Resonance Theory and Field Theory. It needs only 
one pass learning, and achieves not only impressive approximating effect but also fast 
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learning speed. Besides, FANRE has inaemental learning ability, which enables it fit 
for real-time online learning tasks. Moreover, FANRE can adaptively set up its 
topology so that the disadvantage of manually determining the number of hidden units 
of most feed-forward neural models is overcome. Experimental results show that 
FANRE is superior to SuperSAB on both approximating effect and training time cost. 

Until now, FANRE has only been compared against BP kind algorithm. In order to 
exhibit its superiority, more comparisons should be done against other neural 
algorithms such as Fuzzy ARTMAP [9], Cascade-Correlation [10], et al. We plan to 
do this work in the near future. Moreover, FANRE has only been applied to artificial 
problems such as function approximating with single input and single output. 
Although some of them are quite difficult, they can not prove that FANRE is effective 
while facing real world tasks with multiple variables. In the future, we also plan to 
develop some neural regression estimation systems based upon FANRE that aims to 
solve real world problems. 
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Abstract Macroeconomic forecasting has traditionally been performed with 
the use of econometric tools though these methods necessarily make many 
theoretical assumptions that are not valid in all circumstances. The main 
advantage of the use of approaches that apply machine learning algorithms to 
economic data is that forecasts largely free of assumptions can be made. This 
study presents an approach to macro economic forecasting that generates fuzzy 
rules from data using a fuzzy control system architecture and evolutionary 
programming. However, the selection of a defuzification method is typically 
performed subjectively in fuzzy control systems. We demonstrate that the 
selection of defuzification method makes a substantial impact on forecasts. In 
order to overcome this subjectivity and further enhance our objectives of 
developing forecasting systems free of any technical or theoretical assumptions 
we introduce a neural network to perform the defuzification. The performance 
of our approach compares very favourably with other data mining techniques 
on cross validation tests with macro economic data. 



1. Introduction 

Macro economic modelling and forecasting has traditionally been performed with 
the exclusive use of mathematical and statistical tools. However, these tools are not 
always appropriate for economic modelling because of uncertainty associated with 
human decision making. The development of any economy is determined by a wide 
range of activities performed by humans as householders, managers, or government 
policy makers. Persons in each role pursue different goals and, more importantly, base 
their economic plans on decision-making in vague and often ambiguous terms. For 
example, a householder may make a decision on the proportion of income to reserve 
as savings according to the rule- {IF my ftiture salary is likely to diminish, THEN I 
will save a greater proportion of my current salary}. Mathematical models of human 
decision-making impose precise forms of continuous functions and overlook the 
inherent fuzziness of the process. 

In addition to imposing a crispness that may not be appropriate, mathematical and 
statistical models necessarily make assumptions that derive from economic theories. 
A large variety of sometimes conflicting models have emerged over the years as a 
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consequence of this. Inferences drawn from a model hold only to the extent that the 
economic theoretical assumptions hold yet this is often difficult to determine. Macro 
economic researchers solely using mathematical or statistical models are compelled to 
make assumptions based on their own subjective view of the world or theoretical 
background and beliefs. For example, hypotheses generated by researchers who 
accept Keynesian assumptions are quite different from hypotheses from Classical 
theorists. Hypotheses are not only dependent upon the subjective beliefs of their 
creators but can easily become obsolete. Completely different economic systems can 
rise in different times in different countries and be described by different models [12]. 

We present a fuzzy system approach to macro economic modelling that better 
represents uncertainty caused by the prevalence of human factors in any economy. An 
evolutionary approach to building a fuzzy forecasting system can facilitate the design 
of a system that is largely free of subjective assumptions, and based only on patterns 
in the data. 

In [17] we showed that fuzzy control systems can be successfully applied for 
macro economic forecasting tasks. Moreover, in investigated examples the results 
were found to compare favourably with those obtained using traditional statistical 
regression models or conventional neural network approximation techniques. In that 
work, the fuzzy sets and membership function were pre-set and defuzification method 
was predetermined. The main objective was to apply an evolutionary search in order 
to generate fuzzy rules that best desaibe macroeconomic data. 

The search for a set of fuzzy rules that best describes maaoeconomic data varies 
according to the different defuzification methods employed in a fuzzy control 
architecture. For example, we report differences in results between the centre of 
gravity, area and maximal height methods. However, there is no theoretical basis for 
preferring one method to another. Given that the choice of defuzification method is 
important for our work and for other applications of fuzzy control, we sought to 
identify a strategy for defuzzifying concepts that was not based on a subjective 
selection of one of the known methods. We summarise our motivation in the 
following way: 

• To suggest an objective way of replacing a defuzification method in the fuzzy 
forecasting system’s structure 

• To improve performance of the fuzzy forecasting system on an example of macro 
economic modelling 

In this paper we show that a neural network can be used to replace a predefined 
defuzification procedure in fuzzy logic reasoning. An application of this method to 
macroeconomic forecasting shows that the system performs favourably on evaluation 
trials. The system works with fuzzy sets of any shape and is more consistent with the 
fuzzy nature of macro economic modelling. 

In the next section of this paper we briefly describe the method we used to generate 
fuzzy rules and highli^t how results vary according to the defuzification method 
selected. This is included in order to provide the background and motivation for the 
work described in this paper. In the following section we describe the use of a neural 
network to perform the defuzification. 
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2. Generation of Fuzzy Rules Using Predefined Defuzification 
Methods 

In [17] we used an evolutionary programming approach to generate fuzzy rules that 
best fit a given data set. We believe that fuzzy logic, though not normally used in 
macroeconomic modelling is suitable for capturing the uncertainty inherent in the 
problem domain. An evolutionary approach to building the system facilitates the 
design of a system that is free of subjective assumptions, and based only on patterns 
in the data. 

Although genetic algorithms have been used to generate fuzzy rules by 
Rutkowska [13] and Karr [6] and Yuan and Zhuang [18] these approaches, in 
attempting to discover optimal rules and ideal membership functions, introduce 
theoretical and technical definitions and assumptions that we wish to avoid. A key 
assumption we make with macro-economic data is that the membership functions that 
map quantitative crisp values onto qualitative fuzzy values can be pre-set. With most 
macro economic indicators there seems to be general agreement about the mapping of 
quantitative terms onto qualitative terms though future research is planned to 
empirically verify this. In order to reduce complexity of the search problem and, in 
light of the nature of economic data, we do not search for near optimal membership 
functions but instead predetermine a membership function that seems reasonable. 

Following positive results from artificially generated data where the functional 
dependence between input variables was known, we tested the rules generation with 
real economic data. We chose economic indicators with well-known 
Interrelationships. The Keynesian General Theory is based on a fundamental 
assumption that the level of national income determines the level of consumption [7]. 
This hypothesis has been quite successfully tested in many developed countries. In 
contrast to this theory, according to classical economic theory, interest rates impact on 
the level of consumption. Qassical theorists believe that if the level of interest rates 
rise, then people expect to earn more money in the future on each dollar saved in the 
present and will reduce present consumption. 

We expected to generate fuzzy rules that depict well known associations depicted 
by both Keynesian and Classical theories and in addition, identify more accurately 
ways in which the theories conflict. Economic data, describing the dynamics of these 
indicators in the United States was obtained from The Federal Reserve Bank of St. 
Louis. The records were collected on a regular basis from 1960 till 1997. Fuzzy rules 
were generated and evaluated using ctoss validation by comparison with linear 
regression and a neural network trained with the same data. Results summarised in 
Table 1 indicate that the fuzzy rules generated demonstrate comparable predictive 
performance compared with the neural network and superior performance when 
compared with linear regression. 

Ilie fuzzy rules represented in tabular format are included in Table 2. Y is change 
from one quarter to the next in national income represented by fuzzy values PH, 
positive high; PL, positive low, NL negative low and NH negative high. I is the 
change in interest rates. The fuzzy rules predict change in consumption. The black 
box nature of neural networks is a distinct disadvantage for the analysis of macro- 
economic data. In contrast, as Table 2 illustrates, fuzzy rules generated without any 
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theoretical assumptions can be used to explore patterns and to even assess the veracity 
of theories. For example, we see from this table that consumption rises (PH, PL) with 
a sharp drop in national income and a drop in interest rates. Keynesian economics 
predicts a drop in consumption in this case whereas Qassical theory predicts a rise. 
The fuzzy rules indicate that a rise does occur but only when the national income 
drops dramatically. When the change in national income is still negative but small 
(Y=NL), consumption does not rise but displays behaviour predicted by Keynesian 
theory. 







Neural network 


WEsmsmm 


Mean 


14.75 


17.31 


23.25 


Std. Deviation 


5.5 


5.56 


10.42 


Median 


13.59 


15.4 


21.98 



Table 1. Comparison of fiizzy rules, neural network and linear regression 
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NH 


NL 


PL 


PH 


Y 


NH 


PH 


PL 


NH 


NL 


NL 


NH 


NL 


NL 


NL 


PL 


NL 


PL 


PL 


NL 


PH 


PH 


PH 


PH 


NL 



Table 2. Fuzzy rules generated for the macroeconomic data. 

Although results from fuzzy rules generation were promising we were concerned 
that the defuzification method used in that study, centre of gravity, may not have been 
the ideal method. We carried out a tenfold cross validation test over test data using 
three different defuzification methods: Centre of gravity. Area method and the 
Maximal height method. Table 3 summarises the evaluation results obtained. 







Area method 


Maximal height method 


Mean 


14.8 


11.9 


19.7 


Median 


13.6 


9.2 


16.8 


Std. Dev. 


5.8 


7.7 


11.8 



Table 3. Prediction results of the generated fuzzy forecasting system using different 
defuzification methods. 

Table 3 indicates that the performance on test data over each cross validation set 
varies substantially depending on the defuzification method used. In order to remain 
consistent with our objective of developing fuzzy rules that were independent of any 
theoretical or methodological biases we sought to derive a defuzification procedure 
that was based in some way only on the data in our training sets. In the next section 
we describes the use of a neural network to achieve this end. 
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3. Neural Network Component of Fuzzy-Neural Hybrid Systems 
as an Objective Defuzification Procedure 

The integration of fuzzy reasoning with neural networks has been fruitful [15], [3], 
[1] though the integration has been performed in many diverse ways. Many hybrid 
systems use a neural network to encode a fuzzy system. Each step in the fuzzy system 
process is equivalent to at least one layer of the neural network [5], [16], [11], [4], 
[10], [14]. Most of the architectures consistent with this approach have at least four 
layers corresponding to fuzzification, intersection, rule application and defuzification 
respectively. Furthermore, these architectures differ from conventional neural 
networks in terms of the uniformity of both the processing nodes and the 
interconnection strategy. 

The main common disadvantage of these systems is a high dimensionality of the 
networks, which puts obvious limitations on an implementation. Maguire, McGinnity 
and McDaid introduced in [8] a hybrid system, which overcomes these limitations. 
But although the number of layers in their systems is three, the number of nodes in the 
hidden layer must be equivalent to the sum of numbers of fuzzy sets distributed over 
all input variables. However, this approach is restrictive in that only the bell-type 
membership function can be used for a chosen neural network activation function. 
Generating the output according to the Sugeno type fuzzy model finds a weighted 
average of rules’ output and is not related to fuzzy representation of the output 
variable. 

In our approach, we do not model the fuzzy system processes using layers of a 
neural network but instead take the outputs of the fuzzy system as inputs to a neural 
network. The neural network outputs are used to determine a crisp value. The neural 
network is not dependent on the vagaries of a defuzification method subjectively 
chosen. The defuzification method presented here can be used with fuzzy rules 
discovered from experts or with those generated from data using the evolutionary 
program outlined above. 

There are two inputs to the neural network: 1) the label that represents a fuzzy set 
and 2) the height of membership function value. These are parameters that underpin 
the centre of gravity, area and maximal height defuzification methods. A brief survey 
of these defuzification methods will highlight the importance of the label and height 
as inputs into a neural network. 

The majority of defuzification methods label each output fuzzy set inferred from a 
fuzzy rules table. The height of the set is important when performing a union of two 
or more sets. Defuzification methods differ on how labels and heights are combined 
to infer a crisp value. For example, in the Area method the crisp value is obtained 
according to the formula: 
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In this formula Zq is an output crisp value; 5,- is an area of the output fuzzy set, 

which is inferred from the fuzzy rule table; z,. is a representative point, which is 
obtained using a height of the fuzzy set. 

In a related vein, the Maximal height method uses a weighted average of height 
and the label of inferred fuzzy rules points, where weights are heights of the union of 
fuzzy sets. Median and centre of gravity methods use the properties of a united output 
fuzzy set, which, in turn, uses heights and labels of unionising fuzzy sets in its 
construction. According to these methods we find the centre of gravity point which 
divides the united output fuzzy set into two equal areas. 

In contrast to all defuzzification methods, instead of using the labels and heights of 
inferred fuzzy rules in any rigid subjective way, we take them as inputs of the neural 
network. Fig. 1 illustrates the structure of the fuzzy-neural hybrid system. 




Fig. 1. Fuzzy-neural hybrid systems structure. 

According to the procedure, reflected in Fig. 1 the first stage of the fuzzy-neural 
forecasting system involves the Fuzzy Control component. 

Having determined a structure of the fuzzy control component of the hybrid 
system, we apply fiizzy reasoning for each record of the data set. In our case we use 
Mamdani’s “min-max-gravity’ fuzzy reasoning method [9] using the “AND” 
operator, with one exception that we stop the fuzzy control procedure on the 
unification stage and following defuzification stage. Thus, for example in two-input- 

one-output case each fact {xO, yO -> zO, xO E A. , yO Ei?,., zO GC, with a 

degree JU^ (jO) ^ I^b > respectively, where A,. C,- are fuzzy 

sets and is a membership function} will produce the label of the output fuzzy 
/ 

set C- and a height for this set, which is calculated according to the formula below: 

^c; = ( 2 )} } 

/ 

For each rule i, pairs of these labels and heights - { C, and i/ . } are obtained 

and form inputs for the feed-forward neural network. In order to reduce the number of 
input nodes for the neural network we take only those pairs that have non-null values 
of the output fuzzy set’s height. Because the number of these pairs can vary, while the 
number of input nodes must be set invariant, we take the number of input nodes to be 
equal to the maximal possible number of non-null values for all input variables plus 
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the same number of corresponding fiizzy set labels. The number of non-null values of 
a membership function equals the result of multiplication of the maximal numbers of 
overlapping fozzy sets for each variable. 

The output of the neural network can conceivably be either a aisp value or a non- 
null value of membership function and the corresponding label of each set it belongs 
to. We have implemented the second option because it is more consistent with the 
fuzzy nature of an output variable. If we change the fuzzy sets’ shapes for output, the 
learning procedure and the system’s properties will vary. If our network outputs a 
crisp value there will be no effect due to the fact that the crisp value stays the same. 

A crisp value is calculated easily given a membership function and corresponding 
fuzzy set label as follows. We calculate the crisp values of the output for each pair of 
non-null values of the membership function and the corresponding fuzzy set label 
which are output by the neural network. Then, we take the average value of these 
crisp values as it is shown in Fig. 2. In Fig. 2 {nl. A, n2, B} are sample values for 
output nodes of the neural network and given fuzzy sets A and B. The crisp value xO 
is obtained as an average of two crisp values - xl and x2 corresponding to each pair of 
non-null value of membership function and fuzzy set label. Experiments show that 
within a well trained neural network the difference between the crisp values of the 
output for each pair of non-null membership function and the corresponding fuzzy set 
label converges to zero. 




fuzzy-neural system 



In the next section we illustrate the use of a neural network for defuzzification with 
a conceptual example. Following that we apply the technique to maCToeconomic data. 



4. A Conceptual Example 

Figure 3 illustrates two sample fuzzyfied input variables - x and y and one output 
variable - z. Al, Bl, Cl are fuzzy sets for variable x; A2, B2 are fuzzy sets for 
variable y; and A3, B3 are fuzzy sets for variable z - an output. Figure 4 illustrates a 
sample fuzzy rules table of the hybrid system. 

Let’s consider two facts given illustrated in Figure 3 for the training of the fuzzy- 
neural hybrid system: {xl, yl - > zl} and {x2, y2 - > z2}. Figure 5 illustrates the 
values of membership function for each fuzzy variable from these two facts. 




7m¥^ 



filiWil 




Fig. 5. Membership function values for the facts - {xl, yl -> zl} and {x2, y2 -> z2} 



Fig. 6 illustrates the infened fuzzy sets and their heights by applying the standard 
Mamdani’s fuzzy reasoning procedure without the unification and subsequent 

f 

defuzification stages. C, represents a fiizzy label and represents the 

corresponding fuzzy sets’ heights, which we obtain according to the Mamdani’s 
procedure for the two given facts - {xl, yl - > zl} and {x2, y2 - > z2}. 



According to Figure 3 the maximal number of overlapping fuzzy sets is 2 for xl 
(A1 and Bl) and 2 for x2 (A2 and B2). Therefore the number of the input nodes for 
the neural network for this example is 2x2x2=8. Training data for the neural network 
are illustrated in Figure 7. 
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Fuzzy Rules form the fuzz^i rules table 
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Fig. 6. Heights and the corresponding fuzzy set labels inferred according to Mamdani’s fuzzy 
reasoning procedure 
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Fig. 7. Sample training for the example neural network 



The next section describes an implementation of the proposed system for macro 
economic forecasting using US economic data. 



5. Implementation of the Fuzzy-Neural Hybrid System with 
Macro-economic Data 

The use of a neural network to perform the defuzification was evaluated using 
macro economic data described in Section 2. Fuzzy rules that predict national 
consumption, C from interest rates, I and national income, Y were generated using the 
evolutionary program cited in that section. The fuzzy rules obtained are illustrated in 
Table 2. The fuzzy sets negative high NH, negative low NL, positive high PH, and 
positive low PL refer to changes in interest rates over a quarter. Data representing 
consumption, interest rates and national income between 1960 and 1997 were 
collected by the Federal Reserve Bank of St Louis. Data transformation took the form 
of converting quarterly records to changes in those values from one quarter to the 
next. 150 records representing quarterly changes were collated. The interval of real 
values of inputs and outputs were set from minimum and maximum observed changes 
in each variable. 

Tenfold cross-validation was used with hold out sets of size 15 and training sets of 
size 135. For each ctoss validation set, fuzzy rules were generated as described in 
Section 2, training sets for the defuzification neural network were created and a neural 
network was trained. The neural network has a feed-forward architecture consisting of 
three layers. The input layer had 8 nodes and the output layer had four nodes. The 
inputs were membership function values and labels as described in the conceptual 
example above. One hidden layer with 8 nodes was used and the network was trained 
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using back propagation of errors with a learning rate of 0.2. A crisp value for 
consumption was calculated from the membership function value and fuzzy set label 
- outputs of the neural network. Each network was trained until the error rate on 
unseen test cases began to rise. Once trained, the network was applied to each record 
of the 135 training set records to defuzzify the outputs. The sum of square of 
differences between consumption predicted by the system and actual consumption on 
the test set of 15 was recorded for each cross validation set. 

The performance of our fuzzy rules generator and neural defiizification approach 
was evaluated against other data mining techniques including a single feed forward 
neural network trained with back-propagation of errors (3 layer, learning rate = 0.2, 
no improvement in error rates after 40-55 epochs), log-linear regression, linear 
regression, and the same fuzzy rules defuzzified using the Area method. Table 4 
illustrates the ranked results of the sum of square of the difference between predicted 
and actual change in consumption for each technique over the ten cross-validation 
sets. 





Neural 

network 


Log-linear 

regression 


Linear 

regression 


Fuzzy system 
(area deffuz. 
method) 




1 


14.0 


2.6 


11.4 


3.1 


3.7 


2 


14.4 


8.1 


11.6 


5.8 


4.8 


3 


14.7 


8.8 


13.7 


7.0 


7.3 


4 


15.1 


9.2 


15.1 


7.8 


7.5 


5 


15.4 


9.4 


21.5 


8.1 


8.9 


6 


16.1 


10.2 


22.4 


10.3 


9.5 


7 


16.6 


11.6 


26.0 


11.0 


11.3 


8 


17.2 


15.0 


29.8 


18.3 


13.1 


9 


18.4 


19.4 


40.0 


24.3 


14.6 


10 


19.1 


32.0 


41.1 


25.1 


14.9 


Mean 


16.10 


12.62 


23.27 


11.9 


9.5 


Median 


15.74 


9.78 


21.98 


9.2 


9.2 


Std. Dev. 


1.71 


8.11 


10.99 


7.76 


3.9 



Fig 8. Comparison of the evaluation results of neural network, linera regression, log-linear 
regression, fuzzy system and fuzzy-neural hybrid system. 

Figure 8 illustrates that mean error over cross-validation sets was lower for the 
fuzzy rules defuzzified using a trained neural network presented here than was the 
case for other techniques. As described in Section 2, defiizification using the area 
method provided fewer errors than defiizification using centre of gravity or the 
maximal height method. The neural defiizification presented here displayed the same 
median as the area method but a lower mean enor and lower variation around the 
mean across cross-validation sets. This suggests that the objective of this study, to 
identify a defiizification method that was not based on a subjective choice but was 
tuned to empirical data, was realised. 
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The generated fuzzy rules defuzzified with the neural network performed very well 
compared with other data mining techniques. The poor performance of the linear 
regression model is consistent with the maaoeconomic theory perspective that 
consumption is not best described as a simple linear relationship with interest rates or 
national income. Although the mean error for the single neural network was quite 
high, the variation over the cross validation sets was quite low. This may suggest that 
some improvement on neural network results may be noticed with more attention 
focused on adjusting learning rates, momentum and bias terms. However, it is also 
possible that contradictory records (i.e. same national income and interest rate inputs 
but different consumption outputs) in the data prohibit better performance no matter 
how much parameters are adjusted. This suggests that the fuzzy system proposed here 
can more adequately deal with contradictory data. Log-linear regression performed 
quite well and, if not for cross-validation set 10, would have indicated very good 
performance. However, this suggests that many data points can be described by a 
log-linear function but that some values will be very poorly described using this 
predefined function. The advantage of using the fuzzy rules generator and neural 
defiizification is that no assumptions about the functional form of the data are used at 
all. To a substantial degree, the data speaks for itself. 



6. Further Research and Conclusions 

We have developed an approach that applies fuzzy reasoning to macroeconomic 
forecasting, a field dominated by mathematical programming and statistical methods 
and riddled with a reliance on theoretical and technical assumptions that give rise to 
substantial variance in forecasts. As far as possible our approach must be free of 
assumptions so that forecasts can depend only on economic data input to the system. 
To realise this end we do not use fuzzy rules identified by experts but generate rules 
using evolutionary programs. Forecasts obtained were quite good but varied 
depending on the defiizification method selected. On our trials the area method 
outperformed centre of gravity or maximal height methods but this is unlikely to be 
the case with all data sets. Instead of selecting one defiizification method over 
another subjectively or engage in expensive empirical trials of each one with every 
new data set, we adapted a neural network for defiizification so as to avoid the use of 
standard methods entirely. Results suggest that our approach can be used for data 
mining and comparative studies with other techniques are favourable. 

Future work will proceed in three directions. We aim to trial the method using 
incomplete and noisy data in order to simulate the economies of many nations where 
economic data is not collected reliably. Secondly, we aim to trial the method on data 
drawn from a variety of non-economic domains. Thirdly, we aim to trial the method 
in an application of game theory to model the behaviour of voters. 

Freixas and Gambarelli [2] identify vastly different outcomes in voter predictions 
based on assumptions made in mapping the proportion of the votes an individual has 
to a measure of the degree of power the individual has in shaping the outcome of a 
decision. We expect that the subjectivity present in this mapping can be avoided by 
using the fuzzy-neural hybrid techniques presented here. 
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Abstract. In order to give users integrated access to a large number 
of heterogeneous, autonomous information somrces, we need an effective 
and efficient mechanism for enabling knowledge to be shared and ex- 
changed. Sharing of knowledge, to be efficient and effective, must take 
accoimt of semantic representation and semantic integration. In this par 
per, we provide a mechanism for knowledge sharing. We have made use 
of WordNet as linguistic knowledge to represent and interpret the mean- 
ing of the information, to integrate the information, and to give users 
efficient access to the integrated information. 



1 Introduction 

An information integration system provides a uniform interface to various in- 
formation sources. Consider a user want to know those professors whose salary 
is over $50,000. This question might be answered using a specific university 
database. However, if the user issues the query with intent to retrieve all the pro- 
fessors whose salary is over $50,000 from all the accessible university databases, 
the question must be answered using integrated information rather than using a 
single source. In order to give users integrated access to those environments, we 
need an effective and efficient mechanism for enabling knowledge to be shared 
and exchanged. Exchange of knowledge, to be effective, must take place in an 
environment where it can be ensured that an information source interprets the 
information in exactly the same way as intended by the other sources. The in- 
formation must also be easy to locate and retrieve. This is only possible where 
the meaning and method of representation of the information are known and 
agreed upon by the information sources. 

In this paper, we introduce a mechanism for knowledge sharing among multi- 
ple information sources. For information sources, we focus on database systems, 
particularly relational database systems. A multidatabase system provides in- 
tegrated access to heterogeneous, autonomous component databases in a dis- 
tributed system. An essential prerequisite to achieving interoperability in mul- 
tidatabase systems is to be able to identify semantically equivalent or related 
data items in component databases. 
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While there is a significant amount of researches discussing schema differ- 
ences, work on semantic issues in the multidatabase is insufficient. Because only 
schema considerations do not suffice to detect semantic heterogeneity [1], we have 
made use of linguistic knowledge from WordNet. WordNet is an on-line lexical 
dictionary and organized by semantic relations such as synonymy, antonymy, 
hyponymy, and meronymy [6] . The noun portion of WordNet is designed around 
the concept of synset which is a set of closely related synonyms representing a 
word meaning. 

The rest of this paper is organized as follows. In section 2, we address some 
preliminaries to progress our approach. In section 3, we explain information 
integration process. An efiicient access mechanism to an integrated environment 
shall be presented in section 4. After we review related works in section 5, we 
offer our conclusion in section 6. 



2 Preliminaries 

In this section, we briefly present several considerations for knowledge sharing 
and address the problem of semantic heterogeneity that must be detected and 
resolved for information integration. An overview of our approach for knowledge 
sharing will be outlined. 

2.1 Considerations for Knowledge Sharing 

In open and dynamic environments such as the Web, numerous information 
sources exist and new information sources can be created autonomously and 
continuously without formal control. In order to give a multidatabase system 
adaptability in those environments, we need a mechanism for enabling knowledge 
to be shared and exchanged. Sharing of knowledge, to be efficient and effective, 
must take account of several considerations; 

1. The meaning of the information in each component database must be rep- 
resented in a unified way {semantic representation) 

2. A multidatabase system must interpret the meaning of the information in 
each component database {semantic interpretation) 

3. A multidatabase system must integrate the information in all the component 
databases {information integration) 

4. An efficient and effective access mechanism must be provided to retrieve 
desired information from the integrated information {information access) 

2.2 Classification of Semantic Heterogeneity 

Semantic heterogeneities include diff'erences in the way the real world is modeled 
in the databases, particularly in the schemas of the databases [7]. Figure 1 shows 
an example to illustrate semantic heterogeneities. 

Since a database is defined by its schema and data, semantic heterogeneities 
can be classified into schema conflict and data conflict [9]. Schema conflicts 
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Compooent Database 1 (CDB^) Component Database 2 (CDB^) 



Undergraduate (sid, name, sex, address, advisor#) 
Graduate (sid, name, sex, address, advisor#) 
FullProfessor (pid, name, sex, office) 
AssociateProfcssor (pid, name, sex, office) 
AssistantProfessor (pid, name, sex. office) 




Student (sid, nm, sex, advisor#) 
Address (sid, street, city, state) 
Professor (pid, nm, sex, salary, office) 


Component Database 3 (CDB 3 ) 


Component Database 4 (CDB 4 ) 


FemaleStudent (sid, name, street, city, state, advtsoi#) 
MaleStudent (sid, name, street, city, state, advisor#) 
FemaleProfessor (pid, name, salary, office) 
MaleProfessor (pid, name, salary, office) 




Pupil (pid, nm, female, male, advisor#) 
Teacher (tid, nm, office) 



Fig. 1. Example database schemas 



mainly result from the use of different structures for the same information and 
the use of different names for the same structures. For example, in figure 1, 
CDBi uses two attributes, female and male, for information on sex, while the 
same information is represented as values of the attribute sex in CDB\. Data 
conflicts are due to inconsistent data in the absence of schema conflicts. 

As our focus is only on the schema conflicts, we assume that data conflicts 
such as different representations for the same data are already conformed. Fo- 
cusing on schema conflicts, we define the types of conflicts which are considered 
in this paper as follows. 

- Entity versus Entity Structure Conflicts (EESC) 

These conflicts occur when component databases use different numbers of 
entities to represent the same information. 

- Entity versus Attribute Structure Conflicts (EASC) 

This type of conflicts occurs if an attribute of some component databases is 
represented as an entity in others. 

- Entity versus Value Structure Conflicts (EVSC) 

These conflicts occur when the attribute values in some component databases 
are semantically related to the entities in other component databases. 

- Attribute versus Attribute Structure Conflicts (AASC) 

These conflicts occur when component databases use difi’erent numbers of 
attributes to represent the same information. 

- Attribute versus Value Structure Conflicts (AVSC) 

These conflicts occur when the attribute values in some component databases 
are semantically related to the attributes in others. 

- Entity versus Entity Name Conflicts (EENC) 

These conflicts arise due to different names assigned to the entities in differ- 
ent component databases. 

- Attribute versus Attribute Name Conflicts (AANC) 

Attribute name conflicts are similar to the entity name conflicts. 
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2.3 An Overview of Our Approach for Knowledge Sheiring 

In our approach, we have made use of WordNet as linguistic knowledge to repre- 
sent and interpret the meaning of the information, to integrate information, and 
to give users efficient access mechanism to the integrated system. The basic idea 
is to make a semantic network for each component database and to use WordNet 
to provide mapping between the semantic networks. Figure 2 shows an outline 
of out approach. 
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Fig. 2. An outline of our approach for knowledge sharing 



Using WordNet and the descriptions of the database objects (entities and at- 
tributes), we construct a semantic network for each component database. Then, 
a global semantic network can be created with the semantic relations in Word- 
Net and the semantic networks. A global semantic network provides semantic 
knowledge about a distributed environment. Also, we provide a semantic query 
language, SemQL, to capture the concepts about what users want, which en- 
ables users to issue queries to a large number of autonomous databases without 
knowledge of their schemas. 

3 Information Integration Process 

As mentioned in section 2, to integrate information from multiple component 
databases, users must represent the meaning of the information in each compo- 
nent database in a unified way. At each component databeise (CDB), to represent 
the information, a local database administrator (DBA) makes descriptions of the 
database objects. In making descriptions, we make reference to ISO/IEC 11179 
[10]. Using the descriptions and WordNet, users create a representation table. 
Then, a semantic network for the component database can be created accord- 
ing to the representation table. All the semantic networks in the component 
databases shall be integrated into a global semantic network for a multidatabase 
system. 
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3.1 Sem^ultic Representation 

The following are principles that must be used to make descriptions in order to 
represent the meaning of the information in a component database. A description 
can be formed with syntactic and semantic rules. Semantic rules govern the 
source and content of the words used in a description and enable meaning to 
be conveyed. Semantics concerns the meanings of description components. The 
components are entity terms, property terms, key terms, and qualifier terms. An 
entity term is a component of a description which represents an activity or object 
in real world. For example, in a description Student Last Name, the component 
Student is entity term. A set of property terms must consist of terms which are 
discrete(the definition of each does not overlap the definition of any other), and 
complete(taken together, the set represents all information concepts required for 
the specification of database objects). For example, in the description Student 
Last Name, the component Last Name is property. A key term is a component of 
a description for a database object, which describes the form of representation of 
the database object. For example, in Student Last Name, the component Name 
is key term. Qualifier terms may be attached to entity terms, property terms, 
and key terms if necessary to uniquely identify a description. 

Syntactic principles specify the arrangement of components within a descrip- 
tion. The entity term shall occupy the first (leftmost) position in the description. 
Qualifier terms shall precede the component qualified. The property term shall 
occupy the next position. And, the key term shall occupy the last position. No 
abbreviations are allowed. For example, a description Student ID is not allowed. 
It must be Student Identification Number. Furthermore, All descriptions shall be 
unique within a component database. 

After descriptions are created according to the above rules, the descriptions 
are decomposed into unit terms. A unit term means a word or a phrase that can 
be found in WordNet. For example, as a compound noun, ’phone number’, can be 
found in WordNet, it is treated as a unit term. The result for this decomposing 
process is a representation table. A representation table consists of object type, 
object name, data type, description, and a set of unit terms. 

In making a representation table, the CDB administrator must cope with 
synonymy and polysemy. To identify unit terms related by synonymy automati- 
cally, we use synsets in WordNet. However, to acquire correct meaning of a unit 
term, the CDB administrator must deal with its polysemy manually. For exam- 
ple, when the local DBA inputs a unit term ’client’ into WordNet, he/she must 
choose one among many different meanings. 

Given a set of unit terms of a component database, each unit term is con- 
nected with a word(or a phrase) in WordNet. The output for this process is 
a semantic network. A semantic network provides mappings between words in 
WordNet and unit terms in a component database. Figure 3 shows a semantic 
network for CDB 2 of figure 1. 
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Fig. 3. A semantic network for CDB 2 



3.2 Semaintic Interpretation and Information Integration 

A multidatabase system must interpret the meaning of the information and 
identify semantically equivalent or related objects. Once semantic networks are 
constructed, they are integrated into a global semantic network. In integrat- 
ing semantic networks, a multidatabase system can detect and resolve semantic 
heterogeneity based on them. 

The following are examples that show how we can detect semantic hetero- 
geneity based on the semantic networks and the semantic relations in WordNet. 
The examples are explained with schemas in 2.2. The results of detection process 
can be used to resolve semantic heterogeneity in information access phase. 



WordNet CDB, 




Fig. 4. Detection of EESC using hyponymy semantic relation 
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Figure 4 shows a partial state of merging two component databases. Accord- 
ing to hyponymy semantic relation in WordNet, we can find that professor is a 
hypernym of full professor, associate professor, and assistant professor. There- 
fore, the entity Professor in CDB 2 is semantically equivalent to the set of enti- 
ties, {FullProfessor, AssociateProfessor, AssistantProfessor}, in CDBi. 



WordNet CDB, 




Fig. 5. Detection of EENC using synsets 



Consider another example that shows the use of synsets in WordNet. Fig- 
ure 5 depicts the detection of EENC using synsets. As student and pupil are 
synonymous in WordNet, we can interpret that the meaning of the two entities. 
Student in CDB 2 and Pupil in CDB^, are the same. 

The output for merging phase is a global semantic network. A global se- 
mantic network provides a multidatabase system with necessary knowledge for 
integrated access to component databases. The types of knowledge are as follows. 

— A multidatabase system must know where to find the relevant information 
on the component databases {access knowledge). 

— A multidatabase system must know which entities, attributes, or values in 
the component databases meet the semantics in the query {semantic knowl- 
edge). 

4 An Efficient Information Access to a Multidatabase 
System 

Users needing to combine information from several databases are faced with the 
problem of locating and integrating relevant information. An efficient and effec- 
tive approach is allowing users to issue queries to a large number of autonomous 
databases with his/her own terms. It frees users from learning schema. We pro- 
pose SemQL as a semantic query language for users to issue queries using not 
schema information but concepts that the users know. 

4.1 SemQL: A Semsmtic Query Language 

The SemQL is similar to SQL except that it has no FROM clause. The basic 
form of the SemQL is formed of the two clauses SELECT and WHERE and has 
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the following form; 

SELECT (concept list) 

WHERE (condition) 

Here (concept list) is a list of concepts whose values are to be retrieved by 
the query. The (condition) is a conditional search expression that identifies the 
tuples to be retrieved by the query. 

The SemQL clauses specify not the entity or attribute names in component 
database schemas but the concept names about what users want. For example, 
suppose a user wants to find those professor whose salary is over $50,000. We 
assume that the user is familiar with SQL, but knows neither of the component 
database schemas. Then the user might issue a query in SemQL using the con- 
cepts that he/she knows; 

SELECT professor. name 

WHERE professor. salary > $50,000 



4.2 Semantic Query Processing Procedure 



The overall procedure of semantic query processing is shown in figure 6. The 
SemQL Processor consists of Query Parser, Resource Finder, Mapping Genera- 
tor, Sub-query Generator, Query Distributor and Integrator. 




Fig. 6. The overall procedure of semantic query processing 
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Step 1: Users issue a semantic query with his/her own concepts to retrieve 
equivalent or related data items. 

Step 2: The Query Parser parses the query and extracts entity, attribute 
and value concepts from the query. 

Step 3: The Resource Finder identifies the relevant component databases 
where the concepts exist using a global semantic network. 

Step 4: The Mapping Generator generates the mappings between concepts 
in original query and representations in component databases. 

Step 5: The Sub-query Generator re-formulates the original query into mul- 
tiple sub-queries for each component database schema according to the 
mappings. In this step, looking up the global semantic network. Sub- 
query Generator adds FROM clause to the sub-query. 

Step 6: The Query Distributor submits the sub-queries to the component 
databases. 

Step 7: The component databases receive the sub-query and execute it. 

And then return the result tuples to the SemQL Processor. 

Step 8: The Integrator merges the intermediate results from various com- 
ponent databases and presents the integrated results to the users. 

In this subsection, we introduce an example query scenario to demonstrate 
the procedure of semantic query process of our approach. Through the example 
scenario, we will also explain how the semantic conflicts can be resolved using 
the global semantic network. The example query is to find those female students 
who live in Seoul. We assume that the user who issues the query only knows the 
concepts about what he/she want. That is, the user does not know the detailed 
schema structure for each component database. 

QUERY : Find those female students who live in Seoul. 

The query can be posed as follows (Step 1): 

SELECT student. name 

WHERE student.sex = ’female’ AND student.city = ’Seoul’ 

The Query Parser parses the query and extracts concepts from the query 
- {student, name, sex, female, city} (Step 2). And then. The Resource Finder 
identifies the relevant component databases, CDB\, CDB 2 and CDB 3 , which 
posses all the concepts (Step 3). The mappings between concepts in original 
query and representations in the relevant component databases are generated 
(Step 4) and shown in figure 7. 

In this example scenario, CDB 3 uses three attributes, street, city, and 
state, for information on address, while CDB\ uses one attribute, address. 
This is the case of AASC. The type of EASC exists where CDB\ uses an at- 
tribute, address, in the Student entity to represent the student’s address, and 
CDB 2 represents the same information in the Address entity. As CDB 3 uses 
the FemaleStudent entity for female students, and the same information for 
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Fig. 7. The mappings between concepts and database objects for the example 
scenario 



sex is represented as values of the sex attribute in CDBi, the type of EVSC 
also occurs in the example scenario. 

Now, the Sub-query Generator re-formulates the original query into three 
sub-queries for CDBi, CDB2 and CDB3 according to the mappings (Step 5). 
Thus, the sub-query for CDBi might be; 

SELECT name 
FROM Undergraduate 

WHERE sex = ’female’ AND address LIKE ’%Seoul%’ 

UNION 
SELECT name 
FROM Graduate 

WHERE sex = ’female’ AND address LIKE ’%Seoul%’ 

The original query might be re-formulated for CDB2\ 

SELECT Student.nm 
FROM Student, Address 

WHERE Student.sid = Address.sid AND Address.city = ’Seoul’ 

and for CDB^ -, 

SELECT name 
FROM FemaleStudent 
WHERE city = ’Seoul’ 

After the Sub-query Generator re-formulates the original query into sub- 
queries, the Query Distributor sends them to CDBi, CDB2 and CDB3, re- 
spectively (Step 6). The three component databases, CDBi, CDB2 and CDB3, 
return the result tuples of the sub-queries to the SemQL Processor (Step 7). 
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Finally, the Integrator merges the results from the three component databases 
and presents the integrated results to the users (Step 8). 

5 Related Works 

Early researches on semantic heterogeneities in multidatabase systems focused 
on procedures to merge individual component database schemas into a single 
global schema. A global schema multidatabase supports a single, integrated 
global view to the user and provides simple and effective paradigm. However, 
creating and maintaining the global schema is difficult. Multidatabase languages 
are an attempt to resolve some of the problems associated with a global schema. 
Multidatabase language approach eliminates problems of global schema creation 
and maintenance, but presents a more complex global interface to the user. 

In several researches [4] [8], new approaches have been developed for integrat- 
ing of information using new technological developments such as agent technol- 
ogy, domain ontologies, intelligent mediator, and high-level query languages, in 
dynamic and open environments. These approaches were designed to support 
flexibility and openness. A common assumption of these dynamic approaches is 
that users know pre-existing knowledge for integrating information, which might 
be a burden to the users. 

Recent advances in online dictionaries and thesauruses make it possible to 
apply linguistic theory in an automated fashion, which enable users to perform 
integrating information more comfortably. The Summary Schemas Model (SSM) 
is proposed as an extension to multidatabase systems to aid in semantic identi- 
fication [2]. The system uses the global data structure to match the user’s terms 
to the semantically closest available system terms. However, this approach tends 
to centralize the search within a single logical index thereby introducing perfor- 
mance limitations for large networks. 

As linguistic theories evolved in recent decades, linguists became increasingly 
explicit about the information a lexicon must contain in order for the phono- 
logical, syntactic, and lexical components to work together in the everyday pro- 
duction and comprehension of linguistic messages [6]. WordNet is an electronic 
lexical system developed at Princeton University. Several approaches [3] [5] use 
WordNet as knowledge about the semantic contents of images to improve re- 
trieval effectiveness. In particular, [3] uses WordNet for query and database 
expansion. 

6 Summary cmd Conclusion 

Using WordNet as linguistic knowledge, we have suggested a method for knowl- 
edge sharing among multiple databases. From the descriptions of database ob- 
jects, we construct a semantic network to represent the meaning of the infor- 
mation in a component database. In merging semantic networks into a global 
semantic network, we can interpret the information in the same way as intended 
by a component database. A global semantic network provides a multidatabase 
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system with necessary knowledge for integrated access to component databases, 
such as access knowledge and semantic knowledge. With the global semantic 
network and a semantic query language, SemQL, we give users an efficient and 
effective access mechanism to integrated information. SemQL captures the con- 
cepts about what users want, which enables users to issue queries to a large 
number of autonomous databases without knowledge of their schemas. 

As it is not possible for any system to capture semantics without human 
interaction, creating a semantic network requires some initial input from users. 
So, user’s descriptions for database objects are crucial for our approach. Weak 
and wrong descriptions may degrade the information integration system, which 
shall be drawback of our approach. However, if users make descriptions according 
to our guidelines, the information integration system let users focus on specify- 
ing what they want, rather than thinking about how to obtain the information. 
That is, our approach for the information integration system frees users from 
the tremendous tasks of finding the relevant information sources and interact- 
ing with each information source using a particular interface. Furthermore, in 
our approach, each information source needs no knowledge of other information 
sources for information integration, which diminishes a burden of the information 
source. 
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Abstract. Data, information and knowledge are all represented in a single 
formalism as “items”. Items contain two types of acceptability measures that 
measure the invalidity of item instances. Objects are item building operators 
that also contain two typtes of acceptability measures. These acceptability 
measures define a graduated acceptability region for data, information and 
knowledge. This region represents ‘just invalid’ knowledge. A quantitative 
calculus estimates the extent to which a knowledge base may be expected to 
extend into this region as time passes. This calculus is simplified by the use of 
the unified knowledge representation. A single rule of knowledge 
decomposition simplifies the structure of the conceptual model. Expressions 
in this calculus are simplified if the knowledge has been decomposed. 

Keywords: expert systems, knowledge representation. 



1. Introduction 

The terms ‘data’, ‘information’ and ‘knowledge’ are used here in a rather idiosyncratic 
sense [1]. The data in an application are those things that can be represented as 
simple constants or variables. The information is those things that can berepresented 
as tuples or relations. The knowledge is those things that can be represented either as 
programs in an imperative language or as rules in a declarative language. A unified 
knowledge representation for conceptual modelling is described in [2]. That 
representation is unified in the sense that no distinction is made between the 
knowledge, information and data throughout the design process. A conceptual model 
is expressed in terms of “items” and “objects” [3]; objects are item-building operators. 
A single rule for “knowledge decomposition” [4] simplifies the maintenance of the 
conceptual model and its implementation. Classical database normalisation [5] is a 
special case of that single rule. The conceptual model reported in [2] is extended here 
to describe ‘just invalid’ knowledge. This is achieved by introducing fuzzy functions 
to describe a graduated region of varying degrees of acceptability, and by introducing a 
calculus that estimates the extent to which a knowledge base may decay and extend 
into this graduated region as time passes. These fuzzy functions are generalisations of 
the knowledge constraints described in [6]. 

Approaches to the preservation of knowledge base integrity either aim to design 
the knowledge base so that it is inherently maintainable [2] or to present the 
knowledge in a form that discourages the introduction of inconsistencies [7]. 
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constraint domain 
knowledge base 



acceptability 
region 



(a) top view 



(b) side view 



Fig. 1. Knowledge base, constraint domain and acceptability region 

Constraints are seldom mentioned in connection with knowledge; they can play a 
useful role in preserving knowledge base integrity. An approach to specifying 
knowledge constraints is described in [6]. Those constraints are two-valued in the 
sense that either they are satisfied or they aren’t. The constraint domain is the union 
of all knowledge base instances that satisfy a given set of knowledge constraints. The 
constraint domain can be visualised as an area within which a given knowledge base 
should reside, and outside which the integrity of chunks of knowledge should be 
questioned. 

The two- valued division of knowledge by the constraint domain is too crude to 
describe ‘just invalid’ knowledge. But an “acceptability region” in which knowledge 
is “more acceptable” the “closer” it is to the constraint domain could make such a 
claim. Acceptability is defined in that sense here. A knowledge base’s acceptability 
region is a graduated region which raises questions of integrity at differing degrees of 
confidence. This acceptability region may be visualised as “a halo surrounding” the 
knowledgebase. This is illustrated in Fig. 1 where in (b) “1” means “true” and “0” 
means “false”. 

A calculus estimates the effects of knowledge decay. The definition of this 
calculus is simplified by the unified knowledge representation. Expressions in this 
calculus are simplified if the knowledge has been decomposed. The work described 
here has a rigorous, formal theoretical basis expressed in terms of the X-calculus; the 
work may also be presented informally in terms of schema. Schema are used when 
the methodology is applied in practice. 



! part/sale-price, part/cost-price, mark-up] 



part/sale-price 



part-number 


dollar- amount 


1234 


1.48 


2468 


2.81 


3579 


4.14 


8642 


5.47 


7531 


6.80 


1470 


8.14 



part/cost-price 



part-number 


dollar-amount 


1234 


1.23 


2468 


2.34 


3579 


3.45 


8642 


4.56 


7531 


5.67 


1470 


6.78 




Fig. 2. Value set of the item [part/sale-price, part/cost-price, mark-up] 
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2. Items and Objects 

Items are a formalism for describing the things in an application; they have a uniform 
format no matter whether they represent data, information or knowledge things [2]. 
The notion of an item is extended here to incorporate two classes of acceptability 
measures. The key to this formalism is the way in which the “meaning” of an item, 
called its “semantics”, is specified. A single rule of decomposition is specified for 
items [8]. Items are either represented informally as “i-schema” or formally as 
A. -calculus expressions. The i-schema notation is used in applications. 



2.1 The Value Set 

The semantics of an item is a function that recognises the members of the “value set” 
of that item. The value set of an information item is the set of tuples that are 
associated with a relational implementation of that item. Knowledge items, including 
complex, recursive knowledge items, have value sets too [2]. For example, the item, 
which represents the rule “the sale price of parts is the cost price marked up by a 
universal mark-up factor”, could have a value set as shown in Fig. 2. 

The value set of an item will change in time t, but an item’s semantics should 
remain constant. The value set of a knowledge item at a certain time T is a (possibly 
infinite) set of tuples such as the set illustrated in Fig. 2. 



2.2 Item Acceptability Measures 



Formally, given a unique name A, an n-tuple (mj, m 2 ,..-, mn), M = Sj m^ , if: 
• is an M-argument expression of the form: 




where {A ; , . . . , A„ } is an ordered set of not necessarily distinct items, each item in 
this set is called a component of item A . 

• is an M-argument fuzzy expression [9] of the form: 




where {A A are the components of item A , K is a fuzzy predicate and ° is 
the fuzzy “min” conjunction. 

• is a fuzzy expression of the form: 

[ ° » ...» » (L)^ ] 

where ° is the fuzzy “min” conjunction and L is a fuzzy expression constructed as 
a logical combination of: 

• Card^ lies in some numerical range; 

• Uni(A p for some i, 1 < i < n, and 

• Can(Aj., X) for some i, 1 < i < n, where X is a non-empty subset of 

{A/,..., A„} - {A.}; 
subscripted with the name of the item A , 
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then the named triple A[ S^, V^, C^] is an M-adic irem with item name A , is 
called the item semantics of A , is called the item value acceptability measure of A 
and is called the item set acceptability measure of A. “UniO^p” is a fuzzy 
predicate whose truth value is “the proportion of the members of the value set of item 
Aj. that also occur in the value set of item A”. “CanfA^., X)” is a fuzzy predicate 

whose truth value is “in the value set of item A , the proportion of members of the 
value set of the set of items X that functionally determine members of the value set of 
item A “Card^” means “the number of different values in the value set of item 
A”. The subscripts identify the item’s components to which that measure applies. 
Given an item A and tuple X, if V^(X) < 1 but ‘close’ to 1 then X is just invalid in 
the value set of A. Also, if < 1 but ‘close’ to 1 then A is just invalid. 

For example, an application may contain an association whereby each part is 
associated with a cost-price. This association could be represented by the information 
item named part/cost-price', the A, -calculus form for this item is: 
part/cost-price[ 

Xxy[SpartM° S cost-priceiv) ° costs(x, y) ]•, 

^xy[ Vpart(x) ° 'V cost-priceiy) ° Ki(x, y)]*, 

[ Cpart ° ^cost-price ° i^l)part/cost-price ] ] 
for some fuzzy predicates Kj and Li, where costs(x, y) is a first-order predicate that 
means “x costs y”. Rules, or knowledge, can also be defined as items. For example, 
the semantics of the [part/sale-price, part/cost-price, mark-up] item is: 

^x^\2y^y2^*[ ^part/sale-priceO^l’^2^ ° ^ part/cost-priceiVi’ V 2^ 

^mark-upi^) ° ((^j ~ yj) (^2 “ ^ * y2^^^* 

The semantics of an item is a function that recognises the members of that item’s 
value set. The value acceptability measure of an item is a fuzzy estimate of the 
likelihood that a given tuple is not in the item’s value set. The value acceptability 
measure does not attempt to estimate the likelihood that a tuple is in the item’s value 
set as that task is performed by the item’s semantics. The set acceptability measure 
of an item is a fuzzy estimate of the likelihood that the general structure of the item’s 
value set is invalid. So an item’s semantics specifies what should be in an 
implementation, and the two acceptability measures are measures of invalidity of what 
is in an implementation. If an acceptability measure does not detect invalidity to any 
level of (fuzzy) significance then this does not imply that an implementation is valid. 

For example, an application could contain a spare-part thing that is represented by 
the item part. If spare parts are identified by their part-number then the semantics of 
part is A.x*[is-a[x:part-number]]* where the function means “x if x is in P” and 
“undefined otherwisd'. Suppose that the generally expected range for part numbers is 
[0, 2 000]. Then the value acceptability measure for the item part could be A,x»[f(x)]» 
where: 

'' 

0 if X < 0 

1 if 0 < X < 2 000 

" I 2 - ^ if 2000 <x <4000 

,0 if X > 4 000 
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Cardpart is number of different part-numbers. Suppose that the generally expected 
range for Cardp^^f is less than 100. Then the set acceptability measure, Cpart^ for the 
item part is a fuzzy predicate; it could be: 



-part 



1 
2 
I 0 



Card 



if Curdpart ^ 100 



van 



100 

if Card part ^ 200 



if 100 < Cardpa;.; < 200 



In the part/cost-price example above, a simple expression for Ki could be: 



Ki(x, y) 




if 0 <\ <2 000 and y > 50 



otherwise 



andLj could be Can( co^r, part)-, ie the proportion of part numbers that functionally 
determine their cost price in the association part/cost-price. In this way, value and set 
acceptability measures are developed for data items, and similarly for information and 
knowledge items. 

Items make it difficult to analyse the structure of the whole application because, 
for example, two rules that share the same basic wisdom may be expressed in terms of 
quite different components; this could obscure their common wisdom. To make the 
inherent structure of knowledge clear ‘objects’ are introduced as item building 
operators [2]. 



2.3 Objects 

Object names are written in bold italics. Suppose that the conceptual model already 
contains the item “part' which represents spare parts, and the item “cost-price” which 
represents cost prices; then the information “spare parts have a cost price” can be 
represented by “part/cost-price” which may be built by applying the “costs” object to 
part and cost-price: 

part/cost-price = costs (part, cost-price) 

Suppose that the conceptual model already contains the item “part/sale-price” which 
represents the association between spare parts and their corresponding selling price, 
and the item “mark-uff which represents the data thing a universal mark-up factor; 
then the rule “spare parts are marked up by a universal mark up factor” can be 
represented by [part/sale-price, part/cost-price,mark-up] which is built by applying 
the “mark-up-rule” object to the items “part/sale-price”, “part/cost-price” and 
“mark-up”: 



[part/sale-price, part/cost-price, mark-up] = 

mark-up-rule(part/sale-price, part/cost-price, mark-up) 
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The conceptual model contains items. A fundamental set of data items in the 
conceptual model is called the “basis”. The remaining items in the conceptual model 
are built by applying object operators to the other items in the conceptual model. As 
for items, objects may either be represented informally as “o-schema” or formally as 
typed A-calculus expressions. 



2.4 Decomposition 

In [8] the decomposition of items and objects is described. Decomposition removes 
hidden relationships from the conceptual model. Hidden relationships can present a 
maintenance hazard. Here decomposition simplifies the estimation of knowledge base 
decay. 

Item join provides the basis for item decomposition. Given the items A and B, 
the item with name A B is called the join of A and B on E, where E is a set of 
components common to both A and B [8]. Using the rule of composition 0, 
knowledge items, information items and data items may be joined with one another 
regardless of type. For example, the knowledge item: 

[cost-price, tax] [Axy» [ ^cost-priceM ° ^tax(y) ° (y ~ X x 0.05)]*, 

^xy[ Vco«-price(x) ° yiax(y)° K 2 (x, y) ]•, 

[ ^cost-price ° ^tax ° i^2)[cost-price, tax] J ] 

for some K2 and L2 can be joined with the information item part/cost-price on the set 
{cost-price} to give the information item part/cost-price/tax. In other words: 

[cost-price, tax] ® {cost-price} P^ti^cost-price = 

part/cost-price/tax[ Xxyz‘[ SpartM ° S cost-priceiy) ° S,ax(z) ° 
costs(x, y) ° (z = y X 0.05)]*, 

A,xyz*[ Vpart(x) ° ^ cost-priceiy) ° ^ taxi^) ° Ki(x, y) ° K2(y, z) ]*, 

[ Cpart ° C cost-price ° ^tax ° (Li ° ^2)part/cost-price/tax } ] 

In this way items may be joined together to form more complex items. 
Alternatively, the 0 operator may form the basis of a theory of decomposition in 
which each item may be replaced by a set of simpler items. An item I is 
decomposable into the set of items D= [Ij, /„} if: /,■ has non-trivial 

semantics for all i, I = /; 0 I 2 0 . . . ® , where each join is monotonic; that is, 

each term in this composition contributes at least one component to 7 [8]. If item / is 
decomposable then it will not necessarily have a unique decomposition. 



3. Estimating Knowledge Base Validity 

The semantics of an item A[ S^, V^, C^] is a function that recognises the 
members of its value set. The value set is a conceptual notion in the system design. 
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Fig. 3. Real and virtual items 

So the value set of the item A — as in the definition of an M-adic item above — at time 
X is: 



l\A) ^ { y} ^ at time X} 

A knowledge base implementation is a set of knowledge items and a set of stored 
relations and data domains representing some information and data items. Some 
information and data items are associated with actual stored data, and some are not. 
Knowledge items are not normally associated with actual stored data. If an item is 
associated with actual stored data then it is a real item; otherwise it is a virtual item. 

The set of tuples in the implementation of the retd item A is denoted by A,“(A) where 
a is the time of the most recent modification to those tuples. Knowledge items may 
be used to derive tuples for virtual data and information items. For example, suppose 
that the real data item mark-up has a stored data value mark-up, and that the real 
information item part/cost-price has a stored relation part/cost-price. Then the 
knowledge item [part/sale-price, part/cost-price, mark-up] — or an “if-then” 

implementation of it — may be used to derive tuples in the relation for the virtual item 
part/sale-price. Further, the knowledge item [part/sale-price, part/tax-payable, 
tax-rate] could then enable the tuples in the relation for the virtual item 
part/tax-payable to be derived. This is illustrated in Fig. 3. If a virtual item Aj is a 
component of a knowledge item A where the tuples (or data values) associated with A,- 

are derived from { A®j(Ay) : Aj is a component of A , j i } using the knowledge A 
then A ,■ is derivable and those tuples (or data values) are called the derived set which is 
denoted by XP(A,) where P is the time at which the derivation is performed. This 
definition is recursive. In Fig. 3 only part/cost-price, mark-up and tax-rate are stored. 
This example shows how the validity of the calculation of the tuples in the relation 
for the virtual item part/tax-payable relies on the validity of those three real items and 
on the validity of those two knowledge items. So if the validity of any of those three 
real items or either of those two knowledge items has “decayed” in some way then 
that calculation may yield an incorrect result. 

For a knowledge base, its implementation may not be valid because updates that 
should have been performed were not, or modifications that should not have been 
performed were. The corruption of a knowledge base by such modifications is not 
considered here. The failure to perform updates is considered. So updates are changes 
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Fig. 4. The implementation and the true set 



that should have been performed on the implementation of real items or knowledge 
items. In addition, incorrect values may be attributed to a knowledge base 
implementation because the derived tuples for some virtual items were calculated prior 
to required updates being performed [10]. At time a the true set for a — real or 
virtual — item A is the set of tuples that should be associated with A at time a; it is 

denoted by A“(A). In other words, the implementation is what is either stored or 
derived in the knowledge base, the true set is what should be either stored or derived. 

Suppose that the implementation of the real data item part was stored at time a. 
Then at a subsequent time T the implementation and the true set may be as shown on 
the left side of Fig. 4. In that Figure the implementation for the data item part 
contains the part number “4567” that should not be there, and does not contain two 
part numbers which should be there. Suppose that the implementation for the real 
information Hem part/cost-price was stored at time a. Then at a subsequent time x 
the implementation and the true set may be as shown on the right side of Fig. 4. 
Likewise the implementation and the true set of a knowledge item may contain tuples 
that should not be there, and may not contain tuples that should be there. 

Fig. 3 shows a knowledge base implementation consisting of two knowledge 
items, three information items and two data items. If the design and implementation 
are correct then the tuples associated with each real or virtual item will be the same as 
that item’s true set. 



3.1 Item and Object Validity 

At time x, the true set A^(A) and X“(A), a < x, of item A may not be the same. 
The implementation of a real item is “correct” as long as its tuples have been 
correctly stored and maintained [11]. The derived set of a virtual item is “correct” as 
long as the knowledge used to derive the tuples for that item has been correctly 
maintained and the stored data used by that knowledge has been correctly maintained. 
In reality we may hope that the implementation is correct, and expect that it is 
incorrect. To measure the extent that the implementation or the derived set are the 
same as the true set, let pxA’ be the proportion of those elements in set X that are 
also in set Y. Then the difference measure'. 
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A(A, B) = VPA/B * PB/A 



is unity if both sets are identical and is zero if one set contains no elements from the 
other; the square root ensures that this measure retains linearity with measured 
proportion. For example, if each set contains exactly half of the members of the 
other set then the value of this measure is 0.5. The value of the difference measure is 
not necessarily equal to the proportion of valid members in either set because the 
difference measure takes account of those elements that are not in each set but should 
be there. The validity of a real or virtual item A at time x is: 



5a(x) = A(X^(A), A“(A)) 

where T > a and the difference measure A is as defined above. 0 < (x) < 1 . If 

5a('^) = 1 then item A is valid If 5yi(x) = 0 then the set of tuples associated 
with item A contains no tuples that it should contain and A is completely invalid. 

The knowledge item [part/sale-price, part/cost-price, mark-up] is built by 
applying the knowledge object mark-up-rule to the three items part/sale-price, 
part/cost-price and mark-up. That knowledge item may be used to derive tuples for 
the information item part/sale-price. The accuracy of these derived tuples relies on the 
implementation of the two real items mark-up and part/cost-price being accurate and 
on the knowledge object mar*-«p-r«/e being accurate. The integrity of these items 
and objects is expected to decay as time progresses. 

If we know precisely what knowledge decay has occurred then we can usually 
rectify it [12]. In practice we tend to have some loose expectation e of the validity 5. 
For example, we may expect that “within a year the whole part/cost-price relation will 
be out of date”. So our expectation for the validity of the part/cost-price item may be 
represented by a function with a linear decay of one year’s extent. Also, we may 
expect that “as the ‘types’ of parts are redesignated, the contents of the part/type 
relation will decay decreasingly over time so that in a year roughly half of the relation 
will be out of date and in a ‘very long time’ the whole relation will be out of date”. 
So our expectation for the validity of the part/type item may be represented by a 
function with an exponential decay with a half-life of one year. The validity estimates 
for these two examples are: 

f 1 - X i/ 0 < X < 1 

^part/cost-pricei'^) ~ ' 

0 otherwise 

^part/typei'^) = 2 ^ for X > 0 

The validity of an object will contribute to the validity of any item generated by 
that object. But objects do not have a value set or an implementation. The validity 
of the item A, generated using object B, A = B{ C, D, E, F) may be analysed 
completely in terms of the validity of items C, D, E and F and the validity of object 
B. In other words, if items C, D, E and F are valid then the validity of item A will 
be attributable entirely to the validity of object B. So the validity of an object is the 
validity of any item generated by that object when applied to a set of valid items. 
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3.2 Propagating Validity Estimates 

It would be convenient if; 

^obicc, x ^ 

but this product rule is not valid in general because the validity of the components — 
in the above example the components are cc and dd — may be logically dependent. 
This product rule is valid if the validity of the object ob and the validity of the two 
component items are all independent. If the validity of X is independent of the 
validity of Y then ( ex(x) | £y('^) ) = Ex(t^)- If the validity of X is determined 
by the validity of Y then ( exCt) | evf't) ) = £Y('t)- 

Suppose that object ob is applied to a set of n component items C = {cc, D} 
where cc is a component item and D is a set of n- 1 component items. The general 
rule for propagating validity estimates through an object operator is: 

1= EoA(C)(t) = eoft(T) X ( ec(T) | eofc(x) ) 

= ec(x) X ( 6oi(T) I ec(x) ) 

where ( ec(T) | eo6(x) ) is “an estimate of the validity of the set of n component 

items C at time x given that the estimate of the validity of object ob at time x is 

^obi'^)” If fhe conceptual model has been decomposed then ( ec(x) | BobW ) = 
ec(x). To propagate validity estimates across a set of component items, suppose that 
the set C is a set of n component items as above. The general recursive rule for 
propagating validity estimates over such a set is; 

1= £{cc, D}W = Eccf't) X ( eD(t) I ecc(T) ) 

where ( £d(x) | e^^fx) ) is “an estimate of the validity of the set of n-1 

component items D at time x given that the estimate of the validity of the component 
item cc at time x is EccC'’^)-” In general ( edC'^) I Ecc(x) ) ed('t) even if the 
conceptual model has been decomposed. Now suppose that item cc is virtual and that 
it is derivable from the set of items D using knowledge item ob( cc, D ), ie that 
Can( cc, D ) = 1 in cc, D ). Then the validity of item cc depends on the 
validity of object ob and on the validity of the items in the set D. Estimates of these 
validities are propagated by: 

1= Scc(t:) = eoi,(i) x ( ed(x) | Eofr(x) ) 

If the conceptual model has been decomposed then (ed(x) [ Eoj(x) ) = Ed(x). 

The three rules above for propagating validity estimates follow from the definition 
of item validity. These three rules lead to complex expressions for validity estimates 
due to the quantity of conditional expressions — ie expressions involving “|”. But the 
quantity of these conditional expressions may be reduced. 

If the knowledge base has been decomposed [8] and if the sub-item relationships 
have been reduced to sub-type relationships between data items then the object 
operators and the basis items in the conceptual model are independent with the 
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possible exception of sub- item relationships between data items [13]. So if the 
knowledge base has been decomposed and there are no sub-item relationships between 
data items then all the conditional expressions may be removed [14]. For example, if 
the knowledge base has been decomposed and there are no sub-item relationships 
between data items then the validity estimate for the virtual information item 
part/tax-payable is: 

^ part/tax-pay ablei'^) ~ ^tax-rulei'^) 

^ ( ^{part/sale-price, tax-ratej^) I ^tax-rulei'^) ) 

= ^tax-rulei'^) * ^ {part/sale-price, tax-rate ji'^) 

— ^tax-rule('^) ^ ^part/sale-pricei'^) 

^ ( ^tax-rate('^) I ^part/sale-pricei'^) ) 

= ^tax-rule('^) ^ ^part/sale-pricei'^) ^ ^tax- ratei"^) 

— ^tax-rulei'^) ^ ^mark-up-rule^) 

* ( part/cost-price, mark-up}^) I ^mark-up-rule^) ) 

* ^tax-ratei'^) 

— ^tax-rulei'^) ^ ^mark-up-rule^) 

^ ^part/cost-pricei'^) * ^mark-up^) ^ ^tax-ratei'^) 

If sub-item relationships are present or if the conceptual model has not been 
decomposed then the calculations become more involved. For example, suppose the 
supervisor data item is a sub-item of the person data item. Consider the real 
person/supervisor item; the implementation of which is populated with 2-tuples 
(person-id, person-id) where the second person is the “supervisor” of the first. 
Suppose that this item is built by applying the super information object to the data 
items person and supervisor: 

person/supervisor = super{ person, supervisor ) 

then the validity estimate of the person/supervisor information item will be: 
^superiperson, supervisor)^) 

— ^{person, supervisor]('^) * ( ^super^) I ^ {person, supervisor}i'^) ) 

~ ^person^) ^ ( ^supervisors) I ^personS) ) ^ ^superS) 

— ^personS)^ ^superS) 



where ( ^superS) I ^{person, supervisor]^) ) — ^superS) assuming that the 
knowledge base has been decomposed and the validity of the super operator is 
independent of the validity of the items to which it is applied; and where 
( ^supervisors) \ ^personS) ) ~ ^personS) because the validity of supervisor is 
assumed, quite reasonably, to be determined by the validity of person. 
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4. Conclusion 

A unified knowledge representation has been extended to represent “just invalid” 
knowledge. A quantitative calculus estimates the effect of knowledge decay on 
knowledge base validity. The estimate of validity in the calculus has been simplified 
by the use of the unified knowledge representation. Further simplification has been 
achieved by decomposing the knowledge in the conceptual model. 



References 

1. Tayar, N. “A Model for Developing Large Shared Knowledge Bases” in proceedings 
Second International Conference on Information and Knowledge Management, 
Washington, November 1993, pp717 — 719. 

2. Debenham, J.K. “Knowledge Engineering” , Springer-Verlag, 1998. 

3. Debenham, J.K. “A Framework For Knowledge Reuse” in proceedings Ilth 
International FLAIRS Conference, Florida, May 1998, ppl99 — 203. 

4. Debenham, J.K. “Knowledge Simplification”, in proceedings 9th International 
Symposium on Methodologies for Intelligent Systems ISMIS'96, Zakopane, Poland, 
June 1996. 

5. Date, C.J., “An Introduction to Database Systems” (4th edition) Addison-Wesley, 
1986. 

6. Debenham, J.K. “Constraints for Knowledge Maintenance”, in proceedings AAAI 
Spring Symposium in Artificial Intelligence in Knowledge Management, Stanford, 
California, March 1997. 

7. Kang, B., Gambetta, W. and Compton, P. “Validation and Verification with Ripple 
Down Rules”, International Journal of Human Computer Studies Vol 44 (2) pp257 — 
270 (1996). 

8. Debenham, J.K. “Representing Knowledge Normalisation”, in proceedings Tenth 
International Conference on Software Engineering and Knowledge Engineering 
SEKE’98, San Francisco, US, June 1998 ppl32 — 135. 

9. Emerson, E.A. “Temporal and Modal Logic” in Van Leeuwen, J. (Ed) “Handbook of 
Theoretical Computer Science”, pp997-1072, MIT Press, 1994. 

10. Katsuno, H. and Mendelzon, A.O., “On the Difference between Updating a Knowledge 
Base and Revising It”, in proceedings Second International Conference on Principles 
of Knowledge Representation and Reasoning, KR'9I, Morgan Kaufmann, 1991. 

1 1 . Debenham, J.K. “From Conceptual Model to Internal Model”, in proceedings Tenth 
International Symposium on Methodologies for Intelligent Systems ISMIS’97, 
Charlotte, October 1997, pp227 — 236. 

12. Walker, A., Kowalski, R., Lenat, D., Soloway, E. and Stonebraker, M., “Knowledge 
Management”, in (L. Kerschberg, Ed.), “Proceedings from the Second International 
Conference on Expert Database Systems”, Benjamin Cummings, 1989. 

13. Debenham, J.K. “A Unified Approach to Requirements Specification and System 
Analysis in the Design of Knowledge-Based Systems”, in proceedings Seventh 
International Conference on Software Engineering and Knowledge Engineering 
SEKE’95, Washington DC, June 1995, ppl44— 146. 

14. Coenen F. and Bench-Capon, T. “Building Knowledge Based Systems for 
Maintainability”, in proceedings Third International Conference on Database and 
Expert Systems Applications DEXA’92, Valencia, Spain, September, 1992, pp415- 
420. 




Maximising Expected Utility for Behaviour 
Arbitration 



Julio K. Rosenblatt 



Australian Centre for Field Robotics 
University of Sydney, NSW 2006, Australia 
julio0mech.eng.usyd.edu.au 



Abstract Utility fusion is presented as an alternative means of action 
selection which ameliorates both the bottlenecks of centralised systems and 
the incoherence of distributed systems. In this approach, distributed 
behaviours indicate the utility of possible world states, along with their 
associated uncertainty. A centralised arbiter then combines these utilities 
and probabilities to determine a Pareto-optimal action based on the 
maximisation of expected utility. Utility theory provides a Bayesian 
framework for explicitly representing and reasoning about uncertainty 
within the action selection process. In addition, the construction of a utility 
map allows the arbiter to model and compensate for the dynamics of the 
system; experimental results verify that the resulting system provides 
significantly greater stability. 



1. Introduction 

In unstructured, unknown, and dynamic environments, such as those encountered 
by outdoor mobile robots, an intelligent agent must adequately address the issues 
of incomplete and inaccurate knowledge; it must be able to handle uncertainty in 
both its sensed and a priori information, in the current state of the agent itself, as 
well as in the effects of the agent’s actions. In order to function effectively in 
such conditions, an agent must be responsive to its environment, as well as goal- 
oriented. When used appropriately, deliberative planning and reactive control 
complement each other and compensate for each other’s deficiencies. 

Centralised architectures provide the ability to coherently coordinate multiple 
goals and constraints within a complex environment, while decentralised 
architectures offer the advantages of reactivity, flexibility, and robustness. 
However, sensor fusion creates a bottleneck, and command arbitration runs the 
risk of losing information valuable to the decision-making process; therefore a 
careful balance must be struck between completeness and optimality on the one 
hand versus modularity and efficiency on the other. In addition, it is important to 
consider the agent’s constraints. 

The Distributed Architecture for Mobile Navigation (DAMN) achieves a 
symbiosis of deliberative and reactive elements; it consists of a group of 
distributed behaviours communicating with a centralised command arbiter, as 
shown in Figure 1 [16]. The arbiter is responsible for combining the behaviours’ 
votes to generate the actions sent to the vehicle controller. A mode manager may 
also be used to vary these weights during the course of a mission. The distributed, 
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asynchronous behaviours provide real-time responsiveness to the environment, 
while the centralised command arbitration provides a framework capable of 
producing coherent behaviour. 



Mode 

M<\NA3ER 






CMMI Iccmmands 
fiform \ ' 




Fig. 1. DAMN framework: centralised arbitration of votes from distributed behaviours 



2. Background 

2.1. Centralised and Behaviour-Based Architectures 

Centralised sensor fusion systems uses all available sensory data to create a 
complete model of its environment, plan a series of actions within the context of 
that model, and then execute that plan [10][11][21], A complete plan to the goal 
may be constructed and followed in an open-loop fashion, or the agent may 
interleave planning and execution; this more closed-loop approach is better able 
to deal with uncertainty and incomplete knowledge. Sensor fusion is able to 
overcome ambiguity and noise inherent in the sensing process [6], but still has the 
disadvantage of creating a computationally expensive bottleneck, and a 
monolithic world model is also more difficult to develop and maintain. In 
addition to introducing delays, a centralised architecture is also more likely to fail 
entirely if any single part of it is not functioning properly, particularly when the 
real world deviates significantly from the models employed. 

As a response to the inefficiencies of centralised systems in controlling autono- 
mous robots, a new generation of architectures emerged which were designed in a 
bottom-up fashion to provide greater reactivity to the robot’s surroundings. Rather 
than constructing the system with functional modules such as perception and 
planning, the system is composed of specialised task-achieving modules, or 
behaviours, that operate independently and asynchronously [4]. Each behaviour 
only receives that information specifically required for its task, thus avoiding the 
need for sensor fusion and its inherent bottlenecks. A behaviour encapsulates the 
perception, planning and task execution capabilities necessary to achieve one 
specific aspect of robot control. Behaviour-based systems tend to be reactive, i.e., 
maintain minimal internal state and respond directly to immediate stimuli. Such 
systems perform no lookahead, and do not represent or explicitly deal with 
uncertainty, in the belief that “the world is its own best model” [5], assuming that 
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no benefit is to be gained from evidence combination and that all previous sensor 
data should be ignored. 

2.2. Command Fusion 

In behaviour-based architectures which employ priority-based arbitration such as 
the Subsumption Architecture [4], action selection is achieved by assigning 
control to the behaviour with the highest priority while the rest are ignored. This 
provides no means for dealing with multiple goals simultaneously. A compromise 
cannot be achieved in such an all-or-nothing scenario; when a behaviour’s output 
is overridden, that information and knowledge represented is completely lost to 
the system [12]. 

In contrast, command fusion architectures combine commands from various 
behaviours so that decisions are made based on multiple considerations. For 
example. Motor Schemas use potential fields as a fi’amework for command fusion 
[1]; however, it has been shown that potential fields suffer from oscillations and 
instability, and that a robot cannot pass through closely spaced obstacles [3]. In 
addition, vector addition results in an averaged command that may not be 
satisfactory to any of the contributing schemas. The root of these limitations is 
that, like with priority-based arbiters, each behaviour simply outputs a single 
command; in addition, they perform no lookahead and thus are subject to local 
minima. 

In a DAMN “actuation-space” arbitration scheme used previously, each 
behaviour votes for or against various alternatives in the actuator command space. 
For example, a turn arbiter receives votes for turn commands from two 
behaviours (Figure 2a & b), computes a normalised weighted sum (Figure 2c), 
and the summed votes are smoothed and interpolated (Figure 2d); the resulting 
command with the maximum vote value is then sent to the vehicle controller [16]. 




^ E^haOTl:vvEi^=Q8 tl fthaMa2weight=02 d^^edSUn d SruJhBd&napdalBd: 

cbaredaivetLre=aOt cteiredaivdire=aO rTEKVcteaiVElLre=Oja35 p6El<aivElu’»aG33 



Fig. 2. Command fusion in DAMN actuation-space arbiter. 

This process is similar to fuzzy logic control systems, performing 
defuzzification using the maximum aiterion, and in fact has been recast into a 
fuzzy logic framework [23]. Fuzzy logic has found many uses for mobile robot 
control, including command fusion systems (see [20] for a survey). Many use 
defuzzification strategies other than maximum, such as centre of mass, >Miich 
assume a unimodal function, and in general this averaging of inputs will at times 
select inappropriate commands. In addition, these systems only deal with 
uncertainty implicitly. 

These command fusion schemes provide mechanisms for the concurrent 
satisfaction of multiple goals as determined by independent behaviours, thus 
allowing incremental, evolutionary system development. However, command 
fusion in general still has shortcomings which reduce the system’s overall 
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effectiveness. A more complete overview of various types of architectures and 
their corresponding advantages and disadvantages can be found in [18]. 

3. Utility Fusion 

A new means of action selection via utility fusion is introduced as an alternative 
to centralised sensor fusion architectures, as well as to priority-based arbitration 
and command fusion in distributed systems. Instead of expressing action 
preferences, behaviours indicate the utility of various possible world states, 
together with stochastic estimates of sensor uncertainty. Utility theory provides a 
unified conceptual framework for defining votes and weights and dealing with 
uncertainty. Because we are attempting to decide which among a set of possible 
actions to take, it is natural to make judgements on the usefulness of each action 
based on its consequences. If we assign a utility measure U(c) for each possible 
consequence, and P(c\a,e) is the probability that consequence c will occur given 
that we observe evidence e and take action a, then the expected utility U(a) for a 
is given in equation 1: 

U{a)=^U{c)-P[c\a,e) (1) 

c 

In the context of mobile robot navigation, for example, actions may be vehicle 
manoeuvring or sensor positioning, positive consequences staying on a road or 
reaching a goal, negative consequences colliding with an obstacle, and the 
evidence would be observations obtained by processing sensory data. The 
conditional probabilities P(c\a,e) are determined by the uncertainty in the sensing 
process, in the vehicle position, and in vehicle control. The subjective utilities 
U(c) are determined for each behaviour based on the relative importance of the 
consequence; for example, a collision would have a large negative utility, while 
the utility of following a pre-planned path may be small, ftovided with these 
utilities and probabilities by the behaviours, an arbiter can then apply the 
Maximum Expected Utility criterion to select an optimal action based on all 
current information [13]. 

Utility fusion does not create a world model as sensor fusion systems do. The 
information combined and stored by the utility fusion arbiter does not represent 
sensed features of the world, as in certainty grids [10] for example, but rather the 
desirability of being in a particular state according to some criterion defined by 
the behaviour. The processing of sensory data is still distributed among 
behaviours, so the bottlenecks and brittleness associated with sensor fiision are 
avoided. 

Unlike command arbitration or command fusion systems, the utility fusion 
arbiter does not select among or combine actions proposed by behaviours. 
Instead, the arbiter is provided with much richer evaluation information fi'om 
behaviours, thus allowing for more complex decision-making. The arbiter 
accumulates utility and probability evaluations from the behaviours and bases its 
action selection on the combined evidence, so that the limitations of command 
fusion systems may be overcome. 
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Virtual sensors provide processed sensor data and uncertainty estimates to the 
behaviours. Uncertainty in the locations of objects, as well as in the position of 
the vehicle, can be generated by various means such as the covariance matrix of a 
Kalman filter [2] or the residual of a linear regression algorithm. Each behaviour 
then uses the output of its associated virtual sensors to determine the utility and 
probability of various world states, and passes these to the utility arbiter, which 
combines and maintains a map of this information and evaluates candidate 
actions within it. The action which maximises expected utility is then sent as a 
command to the controller. 

For example, a utility arbiter has been developed for vehicle steering. Figure 
3a shows an area of positive utility associated with a road and an area of negative 
utility associated with a detected obstacle, with the lighter polygons suggesting 
the reduced probabilities as distance increases. The arbiter evaluates the possible 
trajectories, shown in the figure as arcs emanating from the vehicle, by summing 
the expected utilities along them, and selects the one for which the total is 
greatest. 




Fig. 3. Utility arbiter; a) evaluating arc trajectories - dark areas indicate high probability; 
b) evaluating clothoid-arcs originating from predicted vehicle position (lighter vehicle). 



3.1. Advantages of Utility Fusion 

Representation of Uncertainty. Uncertainties are often accounted for in an od 
hoc manner, for example by “growing” the size of observed obstacles or by 
arbitrarily “fuzzifying” the inputs to a system to determine an approximately 
appropriate output. Similarly, potential fields implicitly deal with uncertainty 
with the field emanating from a point. Although the frizzy behaviour blending 
described in [19] is formally defined as a logic of graded preferences [15], which 
is closely related to utilities, there is no objective measure and treatment of 
uncertainty based on probabilities. 

Utility theory provides a Bayesian framework for reasoning about uncertainty, 
teasing apart the value of the consequence of an action from the probability it will 
occur [2]. Each behaviour votes for the subjective utility of being in the particular 
states of concern to that behaviour, e.g., obstacle or road locations, along with 
associated uncertainties expressed as covariances in a multi-dimensional normal 
distribution. The arbiter then applies the Maximum Expected Utility criterion to 
select an action that is Pareto-optimal [14]. Estimates of the uncertainties in the 
vehicle’s position and control should also be taken into account when calculating 
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expected utility; however, this has not yet been implemented. By casting the 
voting scheme within the framework of utility theory, uncertainty within the 
system is explicitly represented and reasoned about in the decision-making 
processes 

System Model. Another advantage of map-based utility fusion over command 
fusion is that the dynamics of the system being controlled can be fully modelled 
and accounted for by the arbiter, providing greater control accuracy and stability. 
For example, a vehicle cannot change instantly from one turn to another, but 
follows a linear change in curvature, resulting in a clothoid path rather than a 
simple arc [7]. Another important difference between commanded and actual 
vehicle trajectories is due to system latencies. To maintain stability, the system 
must be able to anticipate latencies and determine which actions are 
achievable [8]. The evaluation of clothoid-arcs from the predicted vehicle state is 
illustrated in Figure 3b. The utility arbiter performs lookahead to consider the 
consequences of its actions in terms of predicted future states, but looks ahead 
only one step to determine the next immediate action, on the assumption that the 
large uncertainties in perception and action in a domain such as outdoor mobile 
robots renders further search futile. Thus, the system produces rational action 
while preserving responsiveness. 

The utility arbiter can take all of this information into account and evaluate 
actions within a single interchangeable module; the behaviours simply express the 
utility of various world states independent of which actions would cause the robot 
to enter that state. This provides greater modularity and interchangeability of 
behaviours; for example, a behaviour developed for a vehicle with Ackerman 
steering could be reused as is for an omnidirectional robot. 

Synchronisation. Allowing the behaviours in a distributed architecture to 
operate asynchronously, each at their greatest possible rate, maximises 
throughput and there-fore reactivity. However, without synchronisation, 
behaviour outputs are based on different system states, so that the semantics of 
combining votes from different behaviours is ill-defined and may yield 
unpredictable results. Command fusion involves combining behaviour outputs 
which are only valid for a brief interval, but the utility arbiter receives votes for 
external world states whose meaning is well-defined independent of the current 
vehicle state. The use of a map allows vote combination to occur within the 
arbiter without imposing timing constraints on the behaviours, while still 
maintaining a consistent interpretation of votes received at different times and 
from different locations, thus allowing reasoning to be coordinated and therefore 
coherent. 

3.2. Implementation of Utility Arbiter 

The utility arbiter for vehicle steering control maintains a local map of the 
utilities and associated uncertainties sent to it by the behaviours. The utilities may 
be represented by geometric constructs (points, lines, and polygons) with 
associated two-dimensional Gaussian distributions, or by a grid; behaviours may 
use whichever is more appropriate; the arbiter maintains the two representations 
independently and combines their utility estimates at the final stages. 
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For the sake of efficiency, the command dioices to be evaluated are 
discretized; the results are smoothed and interpolated to avoid problems with 
bang-bang control, i.e., switching between discrete choices to achieve an 
intermediate value. 

Initialise Vehicle Tr^ectories: Given N possible steering commands, create an 
NiiN matrix A, with each element Aij containing a trajectory corresponding to 
commanding curvature k, while the vehicle is currently executing an arc of 
curvature Kj. 

Collect Behaviour Utilities: Collect utilities, tagged by the sending behaviour 
with the position of the vehicle at the time the utilities were generated. 

Predict Vehicle State: Get current vehicle state from controller, position 
(x„y^Q,), speed v,, and curvature k,. Compute predicted state based on estimated 
latency. 

Transform Utility Coordinates: For each point, line, or polygon utility, 
transform the coordinates to the predicted vehicle reference frame. Let these 
coordinates be (x„’, y„’). If any of these lies more than three standard deviations 
behind the vehicle, i.e., (y«’ -y,’) > for vertices, remove that utility object 
from the map. 

Compute Expected Utilities: For each utility u and each point (x„\ y„’) along a 
candidate vehicle trajectory, determine the transformed coordinates yu’*) of 
the point that is closest in Mahalanobis distance from (x„\ y„’). Grid utilities are 
indexed by the vehicle position; zero is used if it is outside the bounds of the 
array. 

The contribution E(n‘,u) of utility u to the expected utility of point n’ = 
(Xn,yn) is the product of the utility value v and the probability as determined by 
the M^alanobis distance between those two points in a bi-normal distribution: 

/((('"’■'" )M]) 

y. W 

Sum Expected Utilities Along Tnuectories: The total expected utility for taking 
action a, i.e., following arc Ay, is then computed by summing the utilities at each 
of N points along the arc, multiplied by a discount factor X (0<X<1) used to 
account for the diminished expected returns of future actions, as in a POMDP: 

U 

Maximise Expected Utility: Determine the maximum expected utility U{6) such 
that no other expected utility U{a) has a greater value: u{a)\'ia\ u{a ) a u{a ) . 

Interpolate Commanded Curvature: Locally approximate the Gaussian 
distribution of votes around the selected action a by fitting a parabola to 
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U(a) and the utility values for the two adjacent commands, C/fa-1) and C/fa+1). 
The peak of this parabola is then chosen as the commanded action. 

Send Controller Command: Send the steering command as determined in the 
previous step to the controller, and add it to a queue for use in feedforward 
prediction. 

Repeat Loop: Return to Collect Behaviour Utilities step. 

3.3. Utility-Based Behaviours 

Within the framework of DAMN, behaviours must be defined to provide the 
task-specific knowledge for the domain. Each behaviour runs completely 
independently and asynchronously, using whichever sensor data and processing 
algorithms are best suited to the task at hand. Because behaviours operate 
asynchronously, each is allowed to perform as much or as little search as 
appropriate; for example, a reactive behaviour may perform no search while a 
deliberative behaviour may search all the way until the final goal and then vote 
for the next action in its constructed plan. DAMN has been used to construct 
several systems using many different behaviours [17]; some examples are 
provided below. 

Obstacle Avoidance Behaviours: The Avoid Obstacles behaviour uses the 
SMARTY obstacle detection system [9], which processes laser range or stereo 
vision images to determine intraversable regions of terrain. For each obstacle 
detected, the behaviour reports to the arbiter a large negative utility with standard 
deviations representing the sensor uncertainty, as well as another negative utility 
with a smaller value and larger standard deviation due to the problems associated 
with getting too close to an obstacle, i.e., constrained mobility, occlusion of 
unknown areas, etc. The obstacle can be represented as either a point or as a 
polygon. 

Goal Seeking Behaviours: While more reactive behaviours operate at a high rate 
to ensure safety and to provide functions such as road following and cross- 
country navigation, deliberative behaviours can process map-based or symbolic 
information at a slower rate, periodically issuing votes to the arbiter that guide the 
robot towards the current goal. These “high level” behaviours do not hand a plan 
down to a lower level for execution, but rather maintain an internal representation 
that allows them to participate directly in the control of the vehicle based on its 
current state [12]. 

For example, the Follow Subgoals behaviour associates a positive point utility 
with each of the subgoals to be reached by the vehicle. A line utility between 
subgoals is also defined so that a corridor is effectively created between 
consecutive goals to draw the vehicle back to the path when it strays, e.g., after 
avoiding an obstacle. The behaviour does not need to decide when a goal has 
been achieved or if it should be abandoned, each utility attracts the vehicle in turn 
as it gets closer; 

More sophisticated map-based planning techniques which determine an 
optimal global path have also been used. For example, the Follow Path behaviour 
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uses the D* planner [22], which creates a grid representing how to reach the goal 
from any location in the map, determined by using A*. The behaviour then sends 
a portion of its map around the current vehicle position to the arbiter. 

4. Experimental Results 

In this section, we present some results from 
experiments conducted on the Navlab II 
HMMWV outdoor mobile robot at Carnegie 
Mellon University (and in a simulation of that 
vehicle), as shown in Figure 4 in the environment 
in which the vehicle experiments took place; the 
terrain in the test area included many natural 
terrain features such as hills, rocks, and ditches. 

In order to demonstrate the advantages of utility 
fusion over command fusion, experiments were run comparing the utility arbiter 
against the “actuation-space” arbiter described briefly in Section 2, which 
accepted votes from the behaviours for and against various turn commands. Trials 
with the utility arbiter’s predictive control capability turned off were also 
conducted so its effect could be observed independently. For experiments at low 
speeds, all arbiters successfully achieved their mission, both in actual vehicle 
experiments and in simulation. However, the effects of latency and dynamics 
became very apparent at higher vehicle speeds, and the utility arbiter with 
predictive control performed much better than the other arbiters. 

4.1. Performance Metrics 

Mean Obstacle Proxiniity. An important metric is the average distance from 
obstacles along a vehicle path. The distance to the closest obstacle provides a 
measure of safety clearance; when inverted, it provides a measure of proximity 
which is to be minimised. The mean obstacle proximity metric for a path is 
defined by the inverse square of the distance to the closest obstacle, integrated 
along the path and normalised by the total number of path points. A lower mean 
obstacle proximity means that the vehicle was on the average further away from 
the nearest obstacle, and therefore that the path was safer. 

Roughness. Roughness is defined by the square of the change in vehicle 
curvature with respect to time, integrated along the path and normalised by the 
total time. A lower measure means that curvature either changed less or more 
gradually along the path, and therefore the vehicle path was smoother [7]. 

4.2. Utility Fusion vs. Command Fusion 

In order to demonstrate the benefits of utility fusion in compensating for vehicle 
dynamics, a vehicle simulator was used for experiments where conditions could 
be carefully controlled and higher speeds could be used without risk of damage. 
The path arbiter with and without predictive control, in conjunction with the 
Obstacle Avoidance and Follow Subgoals behaviours (described above) were 




Fig. 4. Vehicle in test area. 
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compared at various vehicle speeds and system latencies against the command 
fusion style turn arbiter and equivalent behaviours used in previous systems 
[9][16], 

The graphs of mean obstacle proximity as a function of speed in Figure 5a and 
of path roughness vs. speed in Figure 5b show that the turn arbiter does very 
badly at higher speeds; these runs are shown in Figure 5c. The graphs also show 
that, at higher speeds, the utility arbiter without predictive control performed even 
worse than the turn arbiter, possibly due to the utility arbiter’s greater complexity; 
these runs are shown in Figure 5d. However, when the utility arbiter made use of 
its predictive control capabilities, it was still able to go through this narrow 
corridor and reach the goal, in spite of the fact that a delay of 2 seconds at a speed 
of 6 m/s meant that the vehicle travelled 12m between the time that a command 
was issued and the time that it would actually be executed. These successful path 
traces are shown in Figure 5e, along with the trace of the position of the vehicle 
as predicted by the arbiter, which coincided well with the actual path. 
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Fig. 5. (a-b) Path metrics as function of speed: a) mean obstacle proximity, b) roughness; 
(c-e) Paths at high speeds with large lag: c) turn arbiter, d) utility arbiter w/o and, e) with 
prediction. 



4.3. Effect of Predictive Control in Utility Arbiter 

The following tests were performed on the vehicle and environment shown above 
in Figure 4, operating at speeds of roughly 0.8 m/s. The utility arbiter was used 
both with and without predictive control in order to study its effect in isolation. 

As can be seen in Figure 6, the vehicle oscillated quite a bit without predictive 
control, yielding a roughness measure of 4.3x10'^, in contrast to the much more 
stable path generated using predictive control, with a roughness measure of 
3.0x10'^. The reduction in mean obstacle proximity, from 0.41 to 0.14, while not 
as dramatic, still represents a significant improvement in vehicle control afforded 
by utility fusion. 
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Fig. 6. Vehicle curvature vs. time, with and without prediction capabilities 



5. Conclusion 

Because reactivity is essential for any system operating in a dynamic, uncertain 
environment, it is necessary to avoid the sensing and planning bottlenecks of 
centralised systems, but if we are to avoid sensor fusion, the system must 
combine command inputs to determine an appropriate course of action. However, 
priority-based arbitration only allows one module to affect control at any given 
time. Command fusion provides a mechanism for the concurrent satisfaction of 
multiple goals and allows modules to be completely independent, thus allowing 
evolutionary system development. However, existing command fusion techniques 
deal with uncertainty in an arf hoc manner, and they do not take system 
constraints into consideration when deciding upon a proper course of action. 

Utility fusion is introduced as a solution to the shortcomings of command 
fusion and sensor fusion systems. Instead of voting for actions, distributed, 
asynchronous behaviours indicate the utility of various possible world states and 
their probabilities based on domain-specific knowledge. The arbiter then 
evaluates various candidate actions, using system models to determine which 
actions can be taken without violating kinematic and dynamic constraints, and to 
provide greater stability. It then selects a Pareto-optimal action based on the 
maximisation of expected utility, thus providing a unified conceptual framework 
for defining the semantics of votes and for dealing with uncertainty. This new 
approach strikes a balance between action selection and sensor fusion and has 
been found to yield many benefits. 

For example, a utility arbiter has been implemented for vehicle steering 
control. Behaviours indicate the relative desirability of various possible vehicle 
locations, and the arbiter maintains a local map of these utilities. The arbiter then 
evaluates candidate actions by summing the expected utilities along each 
trajectory, taking uncertainty into account. The arbiter then chooses that trajectory 
which maximises expected utility and sends that steering command to the vehicle 
controller. 

The utility space is not time-dependent, so that an arbiter using such a 
representation is capable of effectively synchronising and maintaining a 
consistent interpretation of the votes received from asynchronous behaviours, 
thus providing coherent reasoning in a distributed system. Behaviours can 
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function without knowledge of the system dynamics, thus increasing their 
reusability for other systems. The utility arbiter can use models of the system 
being controlled to determine which states are actually attainable, and to increase 
the accuracy and stability of control. In particular, the map-based path arbiter 
gathers information from behaviours about the desirability of possible vehicle 
locations and then evaluates candidate trajectories to determine appropriate 
actions. The arbiter can then use kinematic models of the robot to determine 
which actions can be commanded without violating non-holonomic constraints, 
and use of the system to provide greater stability. Thus, utility fusion provides 
coherent, optimal reasoning in a distributed, asynchronous system, combining the 
advantages of sensor fusion and command fusion while avoiding many of their 
drawbacks. It provides a well defined semantics of votes and uncertainty, and has 
been demonstrated experimentally to result in measurably better control. 

DAMN has been used to combine various systems of differing capabilities on 
several mobile robots, at various sites; in addition to its use on the CMU Navlab 
vehicles, DAMN has also been used at the Lockheed Martin Corporation, the 
Hughes Research Labs, and the Georgia Institute of Technology. DAMN arbiters 
have been used to integrate navigation modules for the steering and speed control 
of single as well as multiple vehicles at these sites, and have also been used to 
select field of regard for the control of a pair of stereo cameras on a pan/tilt 
platform. Vehicles under the control of DAMN have driven at highway speeds, 
navigated across stretches of off-road terrain some kilometres in length, 
cooperated with other robotic vehicles, and performed teleoperation, all while 
providing for the safety of the vehicle and meeting mission objectives. Current 
work at the University of Sydney involves behaviour coordination for 
autonomous underwater vehicles. 
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Abstract. We often resort to graphical meeins in order to describe non- 
linear structures, such as task dependencies in project planning. There 
are many contexts, however, where graphical means of presentation are 
not appropriate, and delivery either via text or spoken language is to be 
preferred. In this work, we take some first steps towards the development 
of natural language generation techniques that seek the most appropriate 
means of expressing non-linear structures using the linear medium of 
language. 



1 Introduction 

Natural language generation — the use of natural language processing techniques 
to create textual or spoken output from some underlying non-linguistic informa- 
tion source — is an area of practical language technology that shows great poten- 
tial. Various natural language generation (nlg) systems have been constructed 
which produce textual output from underlying data sources of varying kinds: for 
example, the FoG system [3] generates textual weather forecasts from numer- 
ical weather simulations; IDAS [5] produces online hypertext help messages for 
users of complex machinery, using information stored in a knowledge base that 
describes this machinery; ModelExplainer [4] generates textual descriptions 
of information in models of object-oriented software; and PEBA [7] interactively 
describes entities in a taxonomic knowledge base via the dynamic generation of 
hypertext documents, presented as World Wide Web pages. 

The present work represents the first steps in exploring how NLG techniques 
can be used to present the information in complex, non-linear data structures. 
In particular, we focus on project plans of the kind that might be constructed 
in an application such as Microsoft Project. These software tools make it easy 
to present the content of project plans via a number of graphical means, such 
as PERT and Gantt charts. However, they do not provide any capability for 
presenting the information in project plans via natural language. We pursue this 
possibility for two reasons. 

Firstly, we are interested in exploring the extent to which language can be 
used to express complex non-linear structures. We might hypothesise that lan- 
guage is not a good means for expressing this kind of information, since language 
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requires us to linearise the presentation of the material to be expressed. How- 
ever, some recent work in NLG has explored the use of mixed mode output, 
where graphics and text are combined; see, for example, [1]. A key question, 
then, is how best to apportion material between the two modalities. We intend 
to build on some of this recent research to see what kinds of information are best 
conveyed using language, and what elements are best conveyed graphically. We 
also aim to explore how sophisticated use of typography — indented structures, 
graphs containing textual annotations, and so on — can overcome some of the 
inherent limitations in purely ‘linear’ text. 

Secondly, we are interested in determining the extent to which the informa- 
tion in a project plan might be conveyed to a user via speech. Suppose a project 
manager is driving to a meeting, and needs a report on the current status of some 
project whose internal structure is complex. Assuming that we do not have so- 
phisticated heads-up displays or other similar presentation technologies, there 
is no possibility here that the information can be presented visually. In such a 
context, speech is the most plausible medium for information delivery, and so 
we are particularly interested in how the information available in a non-linear 
structure can most effectively be presented in a linear speech stream. Similarly, 
speech may be the delivery medium of choice for users who are vision-impaired. 

In this paper, we present some first steps towards achieving these goals. Our 
focus here is on a specific but particularly important sub-problem: how do we 
produce descriptions of parallel structures in such a way as to avoid ambiguity in 
interpretation? We have implemented a simple NLG system, PlanPresenter, 
that takes project plan information as input, and produces from this informa- 
tion a text that describes the dependencies between the project plan elements. 
Section 2 presents an overview of the system, describing the key components. Sec- 
tion 3 shows how PlanPresenter generates text from a simple input project 
plan, and Section 4 shows how a more complex example is dealt with using our 
intermediate level of representation to allow the required flexibility in the gen- 
eration process. Section 5 summarises the state of the work so far and sketches 
our next steps in this research. 

2 System Overview 

The system we describe here takes as its starting point earlier work described 
in [6], but departs from the system described there by adopting more recent ideas 
regarding the decomposition of the natural language generation process and the 
intermediate levels of representation that are required, as described in [8]. The 
input to our system is approximately equivalent to the information that can be 
extracted from a combination of the interchange formats provided by Microsoft 
Project, and which is likely to be available for any such project management tool. 
More particularly, we assume that we will be provided with a set of constructs 
corresponding to the basic undecomposable tasks in the plan — we will call these 
ATOMIC ACTIONS — and a set of dependency links that indicate which tasks must 
be completed before other tasks. 
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Fig. 1. A project plan for going to see a movie 



It is likely that most project management systems will also make available 
information about other aspects of a project plan, such as the resources allocated 
to specific tasks, and hard constraints over temporal attributes such as start 
and end dates and task durations. In some contexts, we may also have access to 
information regarding the hierarchical relationships between plan components. 
We intend to make use of these elements of data as our work on project plan 
description proceeds; at this early stage, however, our primary concern is the 
linearisation of the information present in this essentially non-linear, networked 
structure. 

The PlanPresenter system consists of three principal components: 

- an INFORMATION STRUCTURER, which reconstructs the given plan informa- 
tion in a form more suited to textual description; 

- a DESCRIPTION STRUCTURER, which assigns specific structural categories to 
all components of the plan; and 

- a SURFACE REALISER, which works out how to express the content of the 
description structure linguistically. 

The result is a set of English instructions instructions for performing that plan, 
written in such a way as to avoid ambiguities of understanding when parallelism 
occurs in the plan. The system is implemented in Prolog. 

3 A Simple Worked Exaimple 

In this section we present a simple worked example that shows how PlanPre- 
SENTER generates a description of a project plan from a symbolic representation 
of that plan. 



3.1 The Project Plein 

Figure 1 shows a PERT chart that indicates the relationships between a number 
of tasks within a larger project. In the example here, each task includes a start 
date, an end date and a duration; we will not make use of these for the moment, 
restricting ourselves to the standard elements of the temporal dependencies be- 
tween the tasks. 
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actionCal, 
action(a2, 
action(2i3, 
action (a4, 
action(a5, 
action(a6. 



[find, people 1] ) . 
[choose, moviel] ) . 
[buy,ticketsl] ) . 
[arremge , placet] ) 
[meet, placet] ) . 
[enter , cinemat] ) . 



Find interested people. 

7, Decide what movie to see. 
7, Buy the tickets. 

7 Arramge the meeting place. 
% Meet at arranged place. 

% Go into the cinema. 



precedes (at , a2) . 
precedes (a3, a5) . 



precedes(a2, a3) . 
precedes (a4, a5) . 



precedes (a2 , a4) 
precedes (aS, a6) 



Fig. 2. The input representation corresponding to the project plan shown in 
Figure 1 



The project is for a group of people to go to see a movie at the cinema 
together. The plan consists of six atomic actions, labelled here al through a6: 
first, we have to find the group of people who are interested in going, then we 
have to decide which movie to see, then we have to buy the tickets and arrange 
where to meet, and then we have to meet and go into the cinema. Note in 
particular that the actions of buying the tickets and of arranging a place to 
meet beforehand can be carried out in any order, or even in parallel. 

The temporal dependencies here are indicated in the pert chart by means of 
arrows. This information is presented to our system as a collection of symbolic 
constructs as shown in Figure 2. Here, for each action we have some additional 
information that will be used in describing this action: this is a pair of the form 
(ActionType, Entity), where the ActionType is drawn from an inventory of actions 
that the system knows how to express linguistically, and the Entity is a symbol 
that corresponds to some entity in the domain.^ Given inputs of this kind, then, 
our goal is to generate a coherent text describing the plan in question. 



3.2 Producing an Output Text 

The present example is a very simple case of plan description; however, it allows 
us to demonstrate some of the essential elements of our method. 



Building the Information Structure: First, we transform the given symbolic 
structures into a representation more suited to textual description. The key 
observation here is that language provides us with a variety of mechanisms for 
indicating both sequence and parallelism, so we re-express the input information 

^ There are clearly issues of specific versus non-specific reference here which complicate 
matters; however, our present focus is on describing the overall structure of a plan, 
so we will sidestep for the moment many of the issues regarding the fine-greiined 
modelling of the entities that participate in the plem. 
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sequence ( [number_elements : 5] , 

[elements : [ [1 , atomic , al] , 

[2 .atomic, a2] , 

[3, simple_brainch, [number.elements : 2], 

[elements; [[1, atomic, a3] , 

[2 , atomic , a4] ] ] ] , 

[4, atomic, a5] , 

[5, atomic, a6]]]). 

Fig. 3. Representing sequence and simple parallelism 



in a form that highlights these relationships. Applying this process to the input 
data shown in Figure 2 results in the following structure:^ 

sequence ( [al , a2 , peurallel ( [ei3 , a4] ) , a5 , a6] ) . 

It is easy to see how, in general terms, such a structure might be mapped directly 
into a text: 

- given a sequence of elements as in this case, we might simply express each 
element in the sequence by means of a sentence; 

- if an element in the sequence is a parallel structure, then we might indicate 
explicitly that all the actions in this structure can be carried out in parallel. 

Such a simple mapping mechanism will not, however, produce appropriate results 
in the case of more complex plans. In particular, if we have parallel structures 
that contain embedded parallelism or other complexities, then a direct mapping 
approach along the lines just sketched will result in unwieldy sentences. 



Building the Initial Description Structure: In order to overcome this prob- 
lem, instead of mapping the plan structure directly into text, we construct an in- 
termediate representation which we call a DESCRIPTION STRUCTURE. This serves 
as an updateable repository for all the information we might need in making de- 
cisions as to how best to describe the plan. We can then perform reasoning 
operations over this structure to determine the best output, before committing 
ourselves to text. In the remainder of this section we show how the description 
structure is constructed and used in the present example; in Section 4 we show 
how this accommodates a more complex case of parallelism. 

Figure 3 shows the initial description structure for our plan. Notice that here 
we have made explicit a number of properties of our original plan structure: 

— We have explicitly indicated how many elements are present at each level in 
the plan structure. 

^ There exist plans whose structure does not reeidily map to the form described here. 

Consider for example Figure 7 with an arrow added from eiction 8 to action 5. 

Generating descriptions of plans such as these is a topic of future work. 
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- We have explicitly numbered each constituent element. 

— We have explicitly indicated whether substructures in the plan are made up 
of atomic actions or are more complex in nature, as in the simple-branch 
element. 

It is by virtue of this last step that our approach provides us with more so- 
phisticated control over the description process. In essence, we identify different 
kinds of structural patterns in plans, where these different patterns correspond 
to different mechanisms for description. Thus, a simple branching structure is 
one where the elements within the parallelism are themselves atomic actions. 
Such a structure is amenable to the direct-mapping form of description sug- 
gested informally above, but more complex structures will require the use of 
more sophisticated linguistic mechanisms. 

Determining Semantic Content: We now have to augment this description 
structure with additional information about the actions to be described. This is 
carried out as a sequence of two related processing steps. First, we incorporate 
information about how the actions themselves are to be described. The Action- 
Type in our input representation corresponds to the semantics of the predicate 
that will be used to describe that action, and the Entity serves as the index of 
the argument to the predicate. The next stage determines how the entities that 
participate in the plan will be referred to. For our present purposes, we do not 
make use of a sophisticated referring expression mechanism; essentially, we use 
simple table lookup to determine how a given entity should be described in a 
plan. At a later date we intend to incorporate more sophisticated algorithms 
for the generation of referring expressions along the lines described in [2]. The 
process of determining semantic content results in the output shown in Figure 4. 



Applying Structure Realisation Strategies: Once we have determined the 
relevant aspects of the description of each of the actions in the plan, we are in 
a position to decide how to realise the overall description structure. We do this 
by means of structure realisation strategies, which can be summarised 
in general terms as follows.^ 

- An action not immediately involved in a parallel description is described in 
a separate sentence, with appropriate adjuncts. 

- Actions involved in simple parallelism are combined in a single sentence. 

- actions involved in more complex parallelism are described in terms of the 
groupings assigned by the information structurer, with each group in a sep- 
arate paragraph and signalled by appropriate adjuncts. See Section 4 for an 
example. Parallelism that is more embedded is signalled by means of the 
same strategy together with indentation. 



® There are additional reeilisation strategies avedlable, including textual ones such as 
numbered lists, and the use of multiple modeilities. These topics are a subject of 
future work. 
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sequence ( [number.elements : 5] , 

[elements: [[1, atomic, 

[index : al , 

predicate: [sem: find], 

2 urgument: [index: peoplel, 
syn: [category: np] , 
text : [some , interested, people] ] ] ] , 

[2, atomic, 

[index : a2 , 

predicate : [sem : choose] , 
argument: [index: moviel, 

syn: [category: np] , 
text: [a,movie]]]] , 

[3, simple_branch, [number_elements : 2], 

[elements : [ [1 , atomic , 

[index : a3 , 

predicate: [sem: buy], 
argument: [index: ticketsl, 

syn: [category: np] , 
text : [the , tickets] ] ] ] , 

[2 , atomic , 

[index: a4, 

predicate: [sem: eurrange] , 
argument: [index: placel, 

syn : [category : np] , 
text: [a,meeting,place] ]]]]]] , 

[4, atomic, 

[index: a5, 

predicate: [sem: meet], 
argument: [index: placel, 

syn : [category : np] , 
text: [the,meeting,place]]]] , 

[5 , atomic , 

[index : a6 , 

predicate: [sem: enter], 
argument: [index: cinemal, 

syn: [category: np] , 
text: [the,cinema]]]]]] ) 

Fig. 4. Adding semantic content and referring expressions to the description 
structure 
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sequence ( [number.elements : 5] , 

[elements: [[1, atomic, 

[index : al , 

syn; [category; s, pre_adjunct; [first,’,’]], 
predicate: [sem; find, 

syn: [category: v, 

vform: imperative]] , 
airgument: [index: peoplel, 
syn: [category: np] , 
text : [some , interested, people] ] ] ] , 

[2, atomic, 

[index; a2, 

syn: [category; s, pre_adjunct: [then,’,’]], 
predicate; [sem: choose, 

syn: [category: v, 

vform: imperative]] , 
argument: [index: movie 1, 

syn: [category: np] , 
text: [a, movie]]]] , 

[3, simple_branch, [number.elements : 2], 

[syn; [category: s] , 
pre_adjunct : [then, ’ , ’] , 
conjimct : [and] , 

post_ad j ; [ ’ , ’ , doing , these , in , any , order , you , like] , 
[elements: [[1, atomic, 

[index : a3 , 
syn: [category: s] , 
predicate: [sem: buy, 

syn: [category: v, 

vform : imperative] ] , 
argument: [index: ticketsl, 
syn: [category: np] , 
text: [the, tickets]]]] 



[...] 



Fig. 5. The result of applying realisation strategies to the description structure 



[paragraphd, [sentence ( [first , ’ , ’ , find, some, interested, people] ) , 
sentence ( [then, ’ , ’ , choose, a, movie] ) , 
sentence ([then, ’ , ’ , buy, the, tickets, and, arrange, 
a, meet ing, place, ’ , ’ , doing, these, in, 
any , order , you , like] ) , 

sentence ( [then, ’ , ’ , meet, at, the, meeting, place] ) , 
sentence ([finally, ’ , ’ , go, into, the, cinema] )] )] 



Fig. 6. The final set of sentence specifications 
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The results of this process are shown in Figure 5. Here, we can see that the 
structure realisation rules for atomic actions have determined various aspects of 
the sentential forms to be used. By taking account of the number of the elements 
in the sequence, appropriate adjuncts for first, then, and finally are added; an 
alternative realisation rule might decide to use the adjuncts second, third and so 
on instead of then. The realisation rules have also determined that the imperative 
forms of the verbs should be used. 



Surface Realisation Our description structure now contains enough informa- 
tion to be able to determine the final lexical content of our plan description. 
Information about the realisation of different verb forms is encoded in the sys- 
tem lexicon by means of entries like the following: 

lex ( [category: verb, sem: find, vf orm: imperative, lex : find] ) . 
lex ( [category : verb, sem; find, vform:progressive,lex:finding] ) . 

The result of incorporating this information is a final specification for the text 
to be generated, as in Figure 6. These specifications are passed to a rendering 
module which, at present, simply uppercases the first character of the first word 
of each sentence and appends a full stop at the end of each sentence, and wraps 
the entire paragraph within appropriate HTML tags: 

<p> 

First, find some interested people. 

Then, choose a movie. 

Then, buy the tickets and arrange a meeting place, 
doing these in any order you like. 

Then, meet at the meeting place. 

Finally, go into the cinema. 

<\p> 

Clearly, some improvements to the overall fluency are possible here, in particular 
with regard to the use of appropriate forms of subsequent. However, the key 
element of the system’s behaviour we wish to focus on here is the use of the 
intermediate level of representation — the description structure — in enabling us 
to create textual realisations whose overall structure is coherent. In the next 
section we look at how this is used in a more complex example. 



4 Deeding with Embedded Parallelism 

4.1 The Project Plan 

Figure 7 shows a PERT chart of a section of a plan dealing with housework. 
This part of the plan deals with cleaning the kitchen and dining area. After the 
task of ensuring that one has the required equipment, the tasks involved follow 
two main parallel branches. On one branch, we have the task of washing the 
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Fig. 7. A more complex plan fragment 



dishes, followed by cleaning the stove, followed by wiping the benches, followed 
by mopping the kitchen floor. On the other branch we start with picking up 
rubbish and dusting the furniture, related to each other in simple parallelism. 
These two tasks are followed by vacuuming the floor, which is followed by taking 
out the garbage. 

This plan provides an example of a structure we call EMBEDDED PARAL- 
LELISM: this occurs when the plan contains two or more collections of actions, 
where the ordering between these collections of actions does not matter, and 
where there is also parallelism within at least one of these collections of actions. 
The case shown here is also more complex than our first example above in that 
the two top-level parallel structures each contain more than one action in a 
sequence. 

This plan information is provided to PlanPresenter in a form similar to 
that shown for our earlier example. 



4.2 Producing ein Output Text 

Given the above input, PlanPresenter produces the following output text: 

First, ensure you have the required equipment. You are now ready for 
two main parts of the cleaning, which may be done in any order, or 
alongside each other. 

The first part is as follows. First, pick up any rubbish and dust the 
furniture, doing these in any order you like. Then, vacuum the carpet. 
Finally, take out the garbage. 

The second part is as follows. First, weish the dishes. Then, clean the 
stove. Then, wipe the benches. Finally, mop the kitchen floor. 

Note that the parallel relationships that exist in the plan are preserved in the 
text. 

As before, in order to generate this text we first construct an information 
structure as follows: 
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sequence ( [number_elements : 2] , 

[elements: [[1, atomic, al] , 

[2, complex.branch, [number_elements ; 2], 

[elements : 

[[1, sequence, [m]mber_elements : 3], 

[elements: [[1, simple.branch, [number.elements : 2], 
[elements: [[1, atomic, a2] , 

[2, atomic, a3]]]] 

[2, atomic, a4] , 

[3 , atomic , a5] ] ] ] 

[2, sequence , [number_elements : 4] 

[elements: [[1, atomic, a6] , 

[2, atomic, a7] , 

[3, atomic, a8] , 

[4, atomic, a9]]]]]]]]]) 

Fig. 8. Representing sequence and embedded parallelism 



sequence ( [al,paurallel( [sequence ([parallel C[a2,a3] ) ,a4,a5] ) , 

sequence ( [a6,a7,a8,a9] )])]). 

Figure 8 shows part of the description structure that is then constructed from 
this representation. Note that this structure differs from our previous example 
in that, in building the description structure, we have recognised the presence 
of a COMPLEX BRANCH. 

The realisation strategies then augment this structure with relevant syntactic 
and semantic information as before; in this case, the presence of the complex 
branch results in a realisation decision that the two parts of this branch should 
be realised by means of separate paragraphs, and that the entire text should be 
preceded by a sentence that indicates the overall structure of the plan. 

Once complete, this description structure is then passed on to the surface re- 
alisation component, which produces the output specification shown in Figure 9. 



5 Conclusions and Next Steps 

In this paper, we have presented a NLG system that addresses the problem of 
generating English natural language descriptions of plans that contain non-linear 
elements. As a first step in this exercise, we have focussed on the problem of how 
to express parallelism at different levels of complexity. We have demonstrated 
how the use of an intermediate representation that encodes information about 
the overall structure of the plan can serve both as a updateable repository of 
information regarding the text to be generated, and as a structure that supports 
reasoning about the best ways to present that information. The resulting texts 
present instructions for performing the input plans in a way that makes attempts 
to remove potential ambiguities in the structures described. 
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[paragraph ( 1 , [sentence ( [f irst , ensure , you, have , the , required , equipment] ) , 
s ent ence ( [y ou , ar e , now , ready , f or , two , main , peurt s , 
of , the , cleaning , which , may , be , done , in , 
any, order, or, alongside, each, other] )] ) , 
p 2 Lragraph( 2 , [sentenceC [the, first, part, is, as, follows] ) , 

sentence ( [first , pick , up , euiy , rubbish , and , dust , the , 

f umitTir e , doing , these , in , any , order , you , like] ) , 
sentence ( [then , vacuum , the , carpet] ) , 
sentenceC [finally, take, out, the, garbage] )] ) , 
paragraphO, [sentence([the, second, part, is, as .follows] ) , 
sentenceC [first .wash, the, dishes] ) , 
sentence ( [then .clean , the , stove] ) , 
sentenceC [then, wipe, the, benches] ) , 
sentenceC [finally, mop, the, kitchen, floor] )] )] 

Fig. 9. The final set of sentence specifications 



So far we have only scratched the surface in this exploration of how to describe 
non-linear structures. There are three major directions in which we intend to 
extend the current work. 

First, in many cases, a plan can be described hierarchically in terms of a 
number of high-level actions, each of which can consist of other high-level actions 
or atomic actions. We aim to incorporate this hierarchical information into our 
descriptions. 

A second avenue of development concerns the means of expression that are 
available to PlanPresenter. So far we have only used simple typographic 
mechanisms, such as paragraph structuring, to indicate the underlying structure 
of the plan. We aim to extend the range of realisation strategies available to the 
system so that more sophisticated outputs can be achieved. 

Finally, so far we do not make use of a significant amount of other information 
regarding durations and resources that is available to us; we intend to incorporate 
this information to provide more complete descriptions of plans. 
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Abstract. This paper introduces a robust, portable system for catego- 
rizing unknown words. It is based on a multi- component architecture 
where each component is responsible for identifying one class of unknown 
words. The focus of this paper is the component that identifies spelling 
errors. The misspelling identifier uses a decision tree architecture to com- 
bine multiple types of evidence about the unknown word. The misspelling 
identifier is evaluated using data from live closed captions - a genre re- 
plete with a wide variety of unknown words. 



1 Introduction 

In any real world use, a Natural Language Processing (NLP) system will en- 
counter words that it does not recognize, what we term ‘unknown words’. Un- 
known words are problematic because a NLP system will perform well only if it 
recognizes the words that it is meant to analyze or translate: the more words a 
system does not recognize the more the system’s performance will degrade. Even 
when unknown words are infrequent, they can have a disproportionate effect on 
system quality. For example, Min [13] found that while only 0.6% of words in 300 
e-mails were misspelled, this led to 12% of the sentences having errors (discussed 
in Min and Wilson [14]). 

Words may be unknown for many reasons: the word may be a proper name, 
a misspelling, an abbreviation, a number, a morphological variant of a known 
word (e.g.bigness), or simply missing from the dictionary. The first step in dealing 
with unknown words is to identify the class of the unknown word; whether it 
is a misspelling, a proper name, an abbreviation etc. Once this is known, the 
proper action can be taken, misspellings can be corrected, abbreviations can 
be expanded and so on, as deemed necessary by the particular text processing 
application. In this paper we introduce a system for categorizing unknown words. 
The system is based on a multi- component architecture where each component 
is responsible for identifying one category of unknown words. The main focus 
of this paper is the component that identifies spelling errors. The misspelling 
identifier uses a decision tree architecture to combine multiple types of evidence 
about the unknown word. The misspelling identifier is evaluated using data from 
live closed captions - a genre replete with a wide variety of unknown words. 

N. Foo (Ed.): AI’99, LNAI 1747, pp. 122-133, 1999. 

(5) Springer- Verlag Berlin Heidelberg 1999 
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This paper is organized as follows. In section 2 we outline the overall archi- 
tecture of the unknown word categorizer (UWC). The misspelling identifier is 
introduced in section 3. Performance and evaluation issues are discussed in sec- 
tion 4. Section 5 compares the current system with relevant preceding research. 
Concluding comments can be found in section 6. 

2 The Unknown Word Categorizer (UWC) 

The goal of our research is to develop a system that automatically categorizes 
unknown words. According to our definition, an unknown word is a word that 
is not contained in the lexicon of an NLP system. As defined, ‘unknown-ness’ 
is a relative concept; a word that is known to one system may be unknown to 
another system. 

Our research is motivated by the problems that we have experienced in trans- 
lating live closed captions: live captions are produced under tight time con- 
straints and contain many unknown words. Typically, the caption transcriber 
has a five second window to transcribe the broadcast dialogue. Because of the 
live nature of the broadcast, there is no opportunity to post-edit the transcript 
in any way. 

As can be seen from Table 3, unknown words comprise about 2.3% of the 
words in our corpus of closed captions from business news programs. On average 
each caption line is 10.6 words (we translate captions line by line). Thus, on 
average every fourth line contains an unknown word. Table 4 provides insight 
into the distribution of unknown words in this corpus. The majority of unknown 
words in this sample are proper names. This is not surprising given the subject 
domain. However, there is also a large proportion of misspellings. This is primar- 
ily due to the constraints under which these captions are produced. As noted 
above, live captions are produced under strict time constraints and there is no 
time for post-editing. The data also contains examples of abbreviations, foreign 
words, and words missing from the lexicon. The thirty one numerical expres- 
sions that occur in this sample have been excluded from this list and from the 
calculations in Table 3 since our MT system includes a separate number parsing 
component. 



Percentage of unknown words 2.33% 

Average line length 10.6 

Percentage of lines, on average, containing an unknown word 24.7% 

Table 1. Data on unknown words from 5000 word sample 



Although motivated by our specific requirements, the unknown word catego- 
rizer would benefit any NLP system that encounters unknown words of differing 
categories. Some immediately obvious domains where unknown words are fre- 
quent include e-mail messages, internet chat rooms, data typed in by call centre 
operators, etc. Any NLP system that needs to interpret, translate or text-mine 
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Name 62 

Misspelling 24 

Abbreviation 4 

Foreign word 2 



Missing from lexicon/morphological varieint 25 
Table 2. Distribution of unknown words from 5000 word sample 



these types of data will need to deal with the problems posed by the many types 
of unknown words. 

The environment in which we foresee using the unknown word categorizer 
imposes a further constraint: the system must be portable to different genres and 
languages, each of which may have different types and proportions of unknown 
words. To deal with these issues we propose a multi-component architecture 
where individual components specialize in identifying one particular type of un- 
known word. For example, the misspelling identifier will specialize in identifying 
misspellings, the abbreviation component will specialize in identifying abbrevia- 
tions, etc. Each component will return a confidence measure of the reliability of 
its prediction (c.f. Elworthy [5]). The results from each component are evaluated 
to determine the final category of the word. 

There are several advantages to this approach. Firstly, the system can take 
advantage of existing research. For example, the name recognition module can 
make use of the considerable research that exists on name recognition (e.g. Mc- 
Donald [12], Mani et al. [11]). Secondly, this approach facilitates tuning for differ- 
ent domains. For example, some domains may have no unknown proper names. 
In these cases this component need not be included. For example, in their auto- 
matic spelling correction system integrated into an intelligent tutor for medical 
students, Elmi and Evens [4] assume that all unknown words are misspellings 
or abbreviations. In this case, all components except the misspelling and ab- 
breviation components can be removed. Thirdly, individual components can be 
replaced when improved models are available, without affecting other parts of 
the system. Fourthly, this approach is compatible with incorporating multiple 
components of the same type to improve performance (cf. Van Halteren et al. [7] 
who found that combining the results of several part of speech taggers increased 
performance). 

3 The Misspelling Identifier 

The main purpose of this paper is to introduce the misspelling identifier. The 
goal of the misspelling identifier is to differentiate between those unknown words 
which are spelling errors and those which are not. We define a misspelling as an 
unintended, orthographically incorrect representation (with respect to an NLP 
system) of a word. A misspelling differs from the intended known word through 
one or more additions, deletions, substitutions, or reversals of letters, or the 
exclusion of punctuation such as hyphenation or spacing. Table 1 contains several 
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examples of misspellings. ‘Distincttion’ differs from the known word ‘distinction’ 
by the addition of a letter. ‘Laidup’ differs from the known phrase ‘laid up’ by the 
deletion of a space. ‘Foss’ differs from the known word ‘force’ by the substitution 
and deletion of letters. Finally, the unknown word ‘clamor’ differs from the known 
word ‘clamour’ by the deletion of the vowel ‘u’. 



Misspelled unknown word Correct Spelling 


distincttion 


distinction 


laidup 


laid up 


foss 


force 


clamor 


clamour 



Table 3. Examples of Misspellings 



Like the definition of ‘unknown word’, the definition of a misspelling is also 
relative to a particular NLP system. For example, a system that only includes 
British spellings like ‘clamour’ will not recognize American alternatives such as 
‘clamor’, and vice versa. We consider these variants to be misspellings (from the 
system’s perspective) since they differ from a known intended word by one or 
more substitutions, deletions, etc. Once identified as a misspelling such words 
can then be ‘corrected’ to the known spelling. 

In order to identify those unknown words which are misspellings, we utilize 
a binary decision tree to model the characteristics of misspellings. Decision trees 
are automatically induced from a series of cases. A range of variables (or features) 
are specified for each case. The resulting decision tree consists of a series of zero 
or more internal decision nodes and terminal leaves. Each node represents a 
decision point. In order to classify a new instance, one starts at the root of the 
tree and follows the decision nodes to a terminal leaf. 

The advantage of decision trees is that they are highly explainable: one can 
readily understand the features that are affecting the analysis (Weiss and In- 
durkhya [17]). They are fast, and the purity of individual nodes provides a 
measure of reliability for individual predictions. Furthermore, decision trees are 
well-suited for combining a wide variety of information. Hence, it is particu- 
larly suited to an explorative investigation such as this paper describes. For this 
project, we made use of the Decision TVee that is part of IBM’s Intelligent Miner 
suite for data mining. 

The features we use are intended to capture the characteristics of misspellings. 
These are predominantly derived from previous research. However, we also in- 
clude some more exploratory features. The features fall into two categories: (i) 
lexical features: characteristics of the unknown word, and (ii) contextual features: 
characteristics of the context in which the unknown word occurs. An abridged 
list of the features that are used in the training data is listed in Table 2 and 
discussed below. We exemplify the discussion using data from our training cor- 
pus: live closed captions from business news broadcasts. The effectiveness of the 
various features is discussed more fully in the section on evaluation. 
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Cheiracteristics of the Word Characteristics of the Context 
Corpus frequency Distance to next unknown 

Word length Part of speech 

Edit distance Suggestion in context 

Ispell information Local frequency 

Character sequence frequency Name in vicinity 
Non-Enghsh chau-acters 

Table 4. Features used in decision tree 



Corpus frequency Vosse [16] differentiates between misspellings and neolo- 
gisms (new words) in terms of their frequency. His algorithm classifies unknown 
words that appear infrequently as misspellings, and those that appear more fre- 
quently as neologisms. Our corpus frequency variable specifies the frequency of 
each unknown word in a 2.6 million word corpus of business news closed captions. 



Word length Agirre et al. [1] note that their predictions for the correct spelling 
of misspelled words are more accurate for words longer than four characters, 
and much less accurate for shorter words. This observation can also be found 
in Kukich [10]. Our word length variables measures the number of characters in 
each word. 



Edit Distcuice Edit-distance is a metric for identifying the orthographic simi- 
larity of two words. Typically, one edit-distance corresponds to one substitution, 
deletion, reversal or addition of a character. Damerau [3] observed that 80% of 
spelling errors in his data were just one edit-distance from the intended word. 
Similarly, Mitton [15] found that 70% of his data was within one edit-distance 
from the intended word. We implement this feature as follows. We use the unix 
ispell program to generate spelling suggestions for each unknown word. We then 
calculate three different distance measures for each suggestion and select the 
measures for the ispell suggestion that is most similar to the unknown word. 
The first distance measure is the simple edit-distance metric introduced above; 
the number of substitutions, insertions, deletions, reversals required to convert 
from one word to the other. The other distance measures we use are the score and 
percentage returned by the ‘lalign’ program (Huang and Miller [8]). The lalign 
program was originally developed for aligning DNA sequences, but it is equally 
effective for aligning character sequences, and returns two scores representing 
their similarity. 



Ispell Information Unix ispell is a spell-checking program that returns a vari- 
ety of information about the words it checks. There are five different return codes 
that indicate the status of the word; correctly spelled, incorrectly spelled with 
suggestions, incorrectly spelled without suggestions, etc. A categorical variable 
captures this information. 
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Cheiracter sequence frequency A characteristic of some misspellings is that 
they contain character sequences which are not typical of the language, e.g.tlted, 
wfiil. Exploiting this information is a standard way of identifying spelling errors 
when using a dictionary is not desired or appropriate. The best results reported 
are obtained using positional binary tri- grams (e.g. Hull and Srihari [9]). A 
positional binary tri-gram array contains information about the existence/non- 
existence of a tri-gram abc occurring in position ijk. Recent work records the 
frequency of trigrams, rather than just (non-) existence. 

The disadvantage of positional binary tri-grams is that they require a large 
number of parameters. In this paper we use an alternate approach that requires 
fewer parameters. Before calculating character sequence frequency, we append 
an identifier, such as to the beginning and end of each word. These identifiers 
are included in determining the character n-greims. For example, the word ‘house’ 
consists of the following tri-grams: #ho, hou, oils, use, se#. These tri-grams are 
sensitive to character sequences at the beginning and end of a word but do not 
differentiate where a sequence occurs within a word. Hence, fewer parameters 
need to be stored. 

Having determined which tri-grams to consider, the next question that arises 
is what source to use to determine frequencies. Early work calculated character 
sequence frequencies from lexicons. However, Zamora [18] obtain better results 
by using a corpus rather than a dictionary. Since our corpus is very noisy, i.e. it 
contains many unknown words, it is not clear whether this corpus would obtain 
the same results. For this reason we hedge our bets and obtain frequency data 
from a range of sources: the Oxford Advanced Learners Dictionary; a 2.6 million 
word corpus of business news closed captions; an 11 million word corpus of 
non- live closed captions (less noisy); and the approximately one million word 
Lund-Oslo-Bergen corpus. The frequencies for each corpus are represented by 
separate features, two for each corpus. The two features contain the frequencies 
of the two lowest frequency tri-grams found in each word. 

The features discussed to this point have been well-motivated by previous 
research. In the following we outline some of the more exploratory features which 
we have included in the current version of the misspelling identifier. 

Non-English Characters This binary feature specifies whether a word con- 
tains a character that is not typical of English words, such as accented characters, 
etc. 

Distance to next unknown This feature identifies the distance, in number of 
words, to the next unknown word. 

Part of Speech (POS) The tagset we used is a reduced version of the tags 
used in the Oxford Advanced Learners Dictionary. 

Suggestion in context As described above, the unix ispell program returns 
a list of possible spellings for the unknown word. This feature specifies whether 
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any of these spellings occur in the nearby context. This feature is motivated 
by the characteristics of live closed captions: captioners occasionally follow a 
misspelling with the correct form of the word. 



Local frequency This feature specifies whether the unknown word itself occurs 
in the local context. Local context in this case is defined as plus or minus twenty 
words. 



Name in Vicinity We include two features identifying whether this word is a 
name or is adjacent to a word identified as a name. We make use of a simple 
proper name recognizer that we have available. This name recognizer cues off 
titles and other easy to determine information since captions do not include case 
information (the main information source for most proper name recognizers). 

All of the features that we have included have at least one thing in common; 
they are extremely portable. Each of them can be readily re-calculated for a 
new domain, or even a new language. Apart from a corpus of the new domain 
and language, the only other requirements are some means of generating spelling 
suggestions (ispell is available for many languages), some type of proper name 
recognizer, and optionally, a part of speech tagger (although we are not yet 
convinced of the usefulness of this feature). If more information sources are 
available, such as the results of morphological, syntactic, or semantic analysis, 
then these can readily be included in the decision tree training corpus. For our 
purposes, where portability is an important criteria, we avoid the use of such 
application-specific information sources. 

4 Evaluation 

In this section we evaluate the misspelling identifier introduced above. 

The training data for the decision tree consists of 1350 cases of unknown 
words extracted from a 2.6 million word corpus of live business news captions. 
The relatively small size of the training data was partly motivated by the porta- 
bility requirement: for training and testing it is necessary to manually identify 
the unknown words that are misspellings, (all the other features can be auto- 
matically generated). Hence, a system that requires only a small training set is 
more portable than one that requires a large training set. 

The data was split into ten training/test sets. The test set contains ten 
percent of the data. The training set contains the remaining ninety percent. 
Each test set covers a different ten percent of the data. The results reported 
below are for the average over the ten tests. Given the small size of the test and 
training sets, this is a more reliable approach than simply splitting the data into 
a single training set. 

The average precision and recall data for the ten tests are given in Table 5, 
together with two base-line cases. The first base-line case assumes that we cat- 
egorize all unknown words as misspellings. Tthe second base-line case assumes 
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that all unknown words are not misspellings. As can be seen, the current state 
of the system is a significant improvement over the baseline cases. The baseline 
case for predicting spelling errors is 39.6% precision. In contrast, the decision 
tree approach obtains 71.8% precision and 71.2% recall. 



B 2 kseline Precision Precision Recall 


Predicting Misspellings 
Predicting Non-misspellings 


39.6% 

60.4% 


71.8% 71.2% 
81.2% 81.2% 



Table 5. Precision and recall for baseline and initial decision tree 



The above results assume that both types of possible errors are weighted 
equally. (The two types of errors are: classifying a misspelling as correct, and 
classifying a correct word as a misspelling.) However, since the misspelling iden- 
tifier is the only opportunity in the system for identifying spelling errors, the 
system should emphasize recall over precision. That is, it is more important 
that the misspelling identifier identify all spelling errors, at the cost of including 
some correctly spelled words. Since the misspelling identifier is embedded in a 
larger system where the results from all components are combined, there is the 
opportunity for those words that are incorrectly deemed to be spelling errors to 
be re-classified. 

Hence, we include results in Table 6 for a weighted tree which emphasizes 
recall over precision. The tree is weighted so that the cost of predicting that a 
misspelling is correct is six times the cost of predicting that a correct spelling 
is a misspelling. (The weighting factor of six was obtained empirically through 
manual manipulation of the weighting factor). Using this approach increases 
the recall to 91.4% with a loss of precision to 57.3%. In the remainder of this 
section, we explore these results in more detail to determine which features are 
most infiuential in identifying spelling errors and which cases are most resistant 
to correct classification. 



Precision ReceJl 

Predicting Misspellings: weighted by 6. 57.3% 91.4% 

Table 6. Precision and recall for weighted tree 



Evaluation of our feature set reveals that several rely on the ispell program 
to predict possible correct alternatives. Features which rely on this information 
include all three edit distance measures and the ‘suggestion in context’ feature. 
This suggests that we may do best in those ceises where possible spellings exist 
and less well in those cases where ispell could not provide alternative suggestions. 
Indeed this is the case. If we include only those cases where ispell can provide a 
suggestion, then our test results improve considerably even though the training 
set is reduced by 35%. These results c£in be found in Table 7. Unweighted results 
are given in the first entry. Recall has increased to 75.9% and precision has in- 
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creased to 74.9%. Weighting these results also gives significantly increased recall, 
as indicated by the second line of Table 7. 97.0% recall of these examples can 
be achieved with 61.0% precision. (Again, the weighting factor was determined 
empirically by manual adjustment of the weighting factor). This indicates that 
we will have to focus more attention on developing features which are less reliant 
on the information provided by ispell in order to increase the accuracy of the 
decision tree onall the examples. 



Precision Recall 

Predicting Misspellings 74.9% 75.9% 

Predicting Misspellings: weighted by 10 61.0% 97.0% 

Table 7. Precision and recall for reduced training data 



Analysis of the ten trees produced by the training process also provides inter- 
esting insight into the strengths and weaknesses of the current approach. In the 
following discussion we evaluate the decision trees produced by reduced train- 
ing/test data. Although each tree was trained on a small training corpus, there 
is considerable similarity between the trees. Three of the features are found in 
each of the trees. These are listed in Table 8, together with the node level at 
which they first appear in each of the ten trees. This table indicates that the first 
three levels of the decision trees are quite similar. Each tree uses the frequency 
of the unknown word in the corpus as the first feature on which to branch the 
tree. Similarly, each tree makes use of edit distance as one of the features on 
which to make the secondary split. Thirdly, the name feature is common to all 
trees at the third level, except for tree5 where it appears at the second level. 



Trees 


treel tree2 tree3 tree4 tree5 tree6 tree7 treeS tree9 treelO 


Frequency 


1 


1 


1 


1 


1 


1 


1 


1 


1 1 


Edit Distauice 


2 


2 


2 


2 


2 


2 


2 


2 


2 2 


Name 


3 


3 


3 


3 


2 


3 


3 


3 


3 3 



Table 8. Features common to all trees and the level at which they first occur 
in each tree 



All but three of the remaining features found in the trees are character se- 
quence features. Recall, we extracted character sequence frequencies from a range 
of corpora. For each unknown word we extract the two lowest frequency character 
tri-grams found in the unknown word. No one frequency feature was definitive 
enough to be found in all of the trees. However, the fact that at least one such 
feature was found in every tree indicates that this type of information has pre- 
dictive value. One of our next steps will be to explore different ways of exploiting 
this information in a way that can be more consistently predictive. 

The three remaining features which occur in the trees are word length, lalign 
percent, and the non-English feature. Table 9 indicates the relative usefulness of 






A Decision Tree-Based Misspelling Identifier 131 



these features. Word length was used in nine out of the ten trees and hence is 
a valuable feature. In contrast, lalign percent and the feature identifying words 
with non-English characters occur only one or two times. The infrequency of the 
lalign percent feature and the total absence of the lalign score feature are no 
doubt due to the superior predictive value of the simple edit-distance feature. 
As we noted in Table 8, edit-distance occurs uniformly in every tree at the 
second branching level. This indicates that the simpler edit distance measure 
may be sufficient for the purposes of detecting misspellings, and that the more 
sophisticated approaches, as represented by the lalign features, are essentially 
redundant. 



Trees treel tree2 treeS tree4 tree5 treefi tree? treeS tree9 treelO 

Word length 5 4434736 4 

Lalign percent 4 5 

Non-English 3 

Table 9. Features common to all trees and the level at which they first occur 
in each tree 



This discussion provides excellent insight into productive directions for future 
research. Our first step will be to experiment with the character sequence features 
in order to identify features that will lead to more stable trees, i.e. we need to 
find one or two such features that can be found across all trees rather than the 
diverse range of character sequence features that we have now which all appear 
inconsistently. Secondly, our best results have been with misspellings for which 
ispell can provide alternate suggestions. We need to evaluate means by which we 
can improve our accuracy on the remaining unknown words. Thirdly, we plan to 
increase the size of the training corpus to determine the possible benefits to be 
gained from a larger training and test set. 

Further, one of our primary goals has been to develop a system that could 
rapidly be tuned for a new domain or language. An immediate goal is to tune 
the system for the domain of e-mail messages and evaluate its performance. 
Like the domain of closed caption, e-mail messages also contain many unknown 
words. However, unlike closed captions, e-mail messages usually contain case 
information, thus making the differentiation between misspellings and proper 
names an easier task. 



5 Related Research 

There is little research that has focused on differentiating misspellings from other 
rtpes of unknown words. For example, research on spelling error detection and 
correction for the most part assumes that all unknown words are misspellings 
and make no attempt to identify other types of unknown words (e.g. Elmi and 
Evens [4]). Naturally, these are not appropriate comparisons for the work re- 
ported here. However, as is evident from the discussion above, previous spelling 





132 Jeuiine Toole 



research does provide an important role in suggesting productive features to 
include in the decision tree. 

Research that is more similar in goal to that outlined in this paper is 
Vosse [16]. Vosse uses a simple algorithm to identify three classes of unknown 
words; misspellings, neologisms, and names. Capitalization is his sole means of 
identifying names. However, capitalization information is not available in closed 
captions. Hence, his system would be ineffective on the closed caption domain 
with which we are working. Granger [6] uses expectations generated by scripts to 
analyze unknown words. The drawback of his system is that it lacks portability 
since it makes use of scripts. 

Research that is similar in technique to that reported here is Baluja et al. [2]. 
Baluja and his colleagues use a decision tree classifier to identify proper names 
in text. Their motivation for using this approach reflects our own: decision tree 
classifiers are effective at combining different types of information. 



6 Conclusion 

In this paper we have introduced the misspelling identifier component of the un- 
known word categorizer. The purpose of the misspelling identifier is to identify 
those unknown words that are misspellings. Because of the requirements of the 
unknown word categorizer in which this component is embedded, the spelling 
identifier must be readily tunable to different domains and languages. To this 
end, we introduced a decision tree-based system which combines multiple types 
of information about the unknown word. The types of features used to char- 
acterize misspellings are readily recalculated for new domains and languages. 
The system provides encouraging results when evaluated against a particularly 
challenging domain: transcripts from live closed captions. Evaluation of the re- 
sults indicates several productive directions for future work. Further, although 
this system has been motivated by the demands of the closed caption domain in 
which we are working, the unknown word categorizer will be useful in any do- 
main that contains different types of unknown words. Relevant domains which 
we identified include e-mail, internet chat, and call-centre data. 
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Abstract. This paper describes Sync/TVans, an incrementEil spoken lan- 
guage treuislation system. The system has been being developed for ef- 
ficiently translating a spontaneous speech dialogue between an English 
speeiker and a Japanese speaker. Its purpose being to behave as a simultar 
neous interpreter, the system produces the target output synchronously 
with the source input. Sync/Trsuis has the following features: (1) the 
system consists of modules that work in a synchronous fashion, (2) the 
system translates the source language possibly word-by-word accord- 
ing to the appearance order, (3) the system utilizes grammatically ill- 
formed expressions for the speech output, and (4) the system corrects the 
grammatical ill-formedness of the speech input at a pretty early stage. 
An experimental system for translating Enghsh speech into Japanese 
speech has been implemented. A few experimental results have shown 
Sync/Treins to be a promising system for simultaneous interpretation. 



1 Introduction 

Immediate speech comprehension and production are essential to a smooth in- 
teraction in a spoken dialogue. A spontaneously spoken dialogue through inter- 
pretation systems thus demands that they should also participate in the dialogue 
without preventing the coherence. 

Our intuitions suggest that efficient speech dialogue translation strongly re- 
quires a simultaneous interpretation which is one of the ambitious applications 
in artificial intelligence [10]. As an example, let us consider a dialogue between 
an English speaker and a Japanese speaker through an English- Japanese inter- 
pretation system and a Japanese-English interpretation system. Figure 1 shows 
a comparison with the dialogue using a conventional machine translation system. 
Since the conventional system cannot start the translation processing until the 
input of an entire Japanese/English sentence finishes, the waiting time ^ of the 

^ This means the time from the end of an utterance by the English/Japanese speaker to 
the start of the utterance by the Japeinese-English/English-Japeinese interpretation 
system. 

N. Foo (Ed.): AI’99, LNAI 1747, pp. 134-143, 1999. 
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English speaker 

E-J interpreter 
Japanese speaker 

J-E interpreter 

English speaker 

E-J interpreter 
Japanese speaker 

J-E interpreter 




(b) Cross-conversation using a simultaneous interpreation system 



Fig. 1. Comparison between a dialogue through conventional systems and a 
dialogue through simultaneous interpretation systems 



English/ Japanese speaker is long. Therefore, the time required for one cycle ^ 
necessarily becomes long. On the other hand, since the simultaneous interpreta- 
tion system can start translating it right after the start of the input, the waiting 
time is reduced. As a result, the time of one cycle also becomes shorter. 

The following enumerates several problems that should be solved in the de- 
velopment of a simultaneous interpretation system. 

- Architecture A machine translation system is usually composed of mod- 
ules such as parsing, transfer and generation, which work sequentially in a 
compositional way [11]. That is to say, each module cannot start the pro- 
cessing until the previous module finishes processing an entire sentence. It 
is practically impossible for such the system to behave as a simultaneous 
interpreter. 

- Incrementality Current most techniques for parsing, transfer and gen- 
eration process a natural language on a sentence- by-sentence basis. Each 

^ This means the time from the start of am utterance to the start of the next utterance 
by the English/ Japanese speaiker. 
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module should be able to build up partial representations for incomplete 
input [4], 

— Difference in word-order To a greater or less degree, the word-order of a 
source language is different from that of the target language. The difference 
might cause the system to have a loss to the simultaneity of the output with 
the input. Most of simultaneous interpretation systems which have been 
proposed so far (e.g., [6, 7, 1,2]), have not solved this problem. 

— Grammatical Ill-formedness Grammatically ill-formed expressions ap- 
pear very frequently in spontaneous speech. It is necessary to investigate a 
method of incrementally translating the ill-formed input. 

This paper describes Sync/Trans, a system for incremental spoken language 
translation between English and Japanese. The aim of this study is to pursue the 
possibility of a simultaneous interpretation and to realize a speech dialogue on it 
as Figure 1(b) shows. The characteristic features of Sync/Trans corresponding 
to the above four problems respectively are as follows: 

— Synchronous performance: The system is composed of mainly two modules: 
parsing and transfer, which work almost synchronously with the input. In 
other words, each module starts the processing right after the start of the 
processing in the previous module. 

— Incremental speech translation: The system makes a translation result of the 
spoken source language on a possibly word-by-word basis according to the 
appearance order [3]. Each module can predict the processing result halfway 
through the input. The next module performs the processing using the result 
at any time. 

— Utilization of grammatically ill-formed expressions: The system utilizes 

grammatically ill-formed expressions characterizing spoken language for the 
translation results [8,9], This is a key to the success of translating between 
English and Japanese, which are different in word-order, in an exceedingly 
incremental way. 

— Correction of grammatical ill-formedness at an early stage: Finding out 

grammatical ill-formedness in an input sentence, the system corrects it im- 
mediately. This enables the system to robustly proceed with the translation 
processing. 

This paper reports on the evaluation of Sync/Trans through a few translation 
experiments. However, we will concentrate on the evaluation of the English-to- 
Japanese translation part of Sync/Trans, because the current implementation is 
restricted on the translation of English into Japanese. 

2 English- Japanese Translation 

2.1 Overview 

We have developed a system for translating English speech into Japanese speech 
incrementally. Figure 2 shows the configuration of the system. The system is 
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Fig. 2. Configuration of the English- Japanese part of Sync/Trans 



composed of eight components: speech recognition, speech synthesis, incremen- 
tal parsing, incremental transfer, parsing rules, transfer rules, a lexicon and a 
chart. The chart component is the data structure which represents the possible 
phrase structures of the source language halfway through the input. Executing 
incremental parsing and transfer sequentially for each word, as a consequence, 
the system can translate a spoken English sentence synchronously with the ap- 
pearance. 



2.2 Production of Grammatically Ill-formed Sentences 

The expressions such as repetitions, inversions, ellipses, repairs and hesitations 
are grammatically ill-formed but natural in Japanese daily conversations. In or- 
der to incrementally and synchronously translate between English and Japanese 
which are different in word-order, the system utilizes these expressions in an 
effective way. 

For example, although the standard Japanese translation of an simple English 
sentence (2.1) is (2.2), the system generates (2.3) synchronously with the input 
of (2.1). 

(2.1) Ken met her in the park yesterday. 

(2.2) ken-wa (Ken) kinoo (yesterday) koen-de (in the park) kanojo-ni (her) 
atta (met). 

(2.3) ken-wa (Ken) atta (met) kanojo-no, anoo, kanojo-ni (her) koen-de (in 
the park) kinoo (yesterday) atta (met). 
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Input (2.1) 


Output (2.2) 


Output (2.3) 


Ken 


ken 


ken 


met 


wa 


wa atta 


her 




kanojo-no 


in 




anoo 


the 




kanojo-ni 


park 




koen-de 


yesterday 


kinoo 


kinoo 




koen-de 

kanojo-ni 

atta 


atta 



Fig. 3. Timing of the output of (2.2) and (2.3) 



We can say that Japanese people can understand (2.3) easily in spite of the 
grammatical ill-formedness. 

Figure 3 shows the timing of the output of (2.2) and (2.3). It is obvious that 
the system can output (2.3) synchronously with (2.1). 

2.3 Chart-Based Framework 

To represent incomplete structures gained on incremental parsing, we have 
adopted a chart- based framework in both parsing and transfer [9]. 

Incremental chart parsing produces edges labeled the term whose category 
is a sentence. The edges represent the possible structures of the entire source 
language inputed up to the point of time. The altering point for the orthodox 
bottom-up chart parsing method [5] is that the operations of applying a parsing 
rule to an active edge and replacing the leftmost undecided term in an active edge 
with the term labeling another active edge are introduced. On the other hands, 
the incremental transfer produces the Japanese expressions by applying transfer 
rules to an edge in a top-down fashion. How the system utilizes grammatically 
ill-formed expressions is described as the transfer rules. 

2.4 Trrmslation of Grammatically 111- formed Sentences 

The system can translate grammatically ill-formed source sentences incremen- 
tally by correcting the error immediately after the parsing fails. Correctly con- 
structing the structure for the well-formed part in the ill-formed sentence, the 
system can reproduce it as the translation result to some extent. 

For example, an English sentence (2.4) can be considered as one in which a 
word “going” of the part “going by train” is omitted. 

(2.4) I think by train is best. 

The system inserts a category, e.g. gerund, immediately after “by” is inputed. 
The Japanese people can understand correctly the semantic contents of (2.5) 
which is the translation result of (2.4). 




Sync/Trams: Simultaneous Machine Interpretation 



139 



Table 1. Translation result of 278 sentences 



type 


sentences rate(%) 


A) correct (no repair) 


96 


34.5 


B) correct (repairs) 


132 


47.5 


C) unnatural 


33 


11.9 


D) incorrect 


16 


5.7 


E) failed 


1 


0.4 



(2.5) watashi-wa (I) omoimasu (think) densya-de-ga (by train) ichiban-ii (is 
best) to-omoimasu (think). 

3 Evaluation 

An experimental system has been implemented in GNU Common Lisp 2.2 on 
a workstation. We have made a few experiments with the dialogues in ATR 
Dialogue Databse, whose task is the application of travels. 

3.1 Basic Experiment 

To evaluate the effectiveness of the system, to begin with, we have made a 
translation experiment using 4 dialogues. The dialogues consist of 278 spoken 
English sentences, the average length of which is 6.8 words. The system has been 
implemented in the scale of 476 English words and 204 grammar rules. In order 
to enhance only the real-time processing of the system, the English input was 
restricted to grammatically well-formed sentences. To satisfy this requirement, 
we have excluded extra-grammatical phenomena such as hesitations and errors 
from the source sentences in advance. 

The success rate was examined. As Table 1 shows, we have classified the 
source sentences according to the translation results. 228 sentences classified into 
(A) or (B) are translated correctly, providing a success rate of 82.0%. The result 
shows the system to be available for spoken language translation. Although the 
successful Japanese sentences gained on the system are different from those on 
a conventional system in the sense that they include much ill-formedness, they 
represent the semantic contents of the source sentences correctly. 



3.2 Translation Processing Unit 

Many causes of the translation failure in the above experiment is that many 
repairs appear too frequently in the translation results (33 sentences classified 
into (C), accounting for 11.9%). In particular, the longer the source sentence be- 
comes, the more repairs appear in the translation. In order to solve the problem, 
we can consider to relax the restriction of the word-by-word basis. 
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Table 2. Translation results on a delay of one word 



type 


sentences rate (%) 


A) correct (no repair) 


122 


43.9 


B) correct (repairs) 


113 


40.6 


C) unnatural 


27 


9.7 


D) incorrect 


15 


5.4 


E) failed 


1 


0.4 



Table 3. Translation results of 278 sentences without error correcting 



type 


sentences rate (%) 


A) correct (no repair) 


21 


7.6 


B) correct (repairs) 


106 


38.1 


C) unnatural 


9 


3.2 


D) incorrect 


45 


16.2 


E) failed 


97 


34.9 



We have tried to make an experiment on a system translating with a delay 
of one word. We have used the same 4 dialogues. Table 2 shows the success 
rate. 235 sentences (accounting for 84.5%) are translated correctly. Although the 
frequency of the repairs averaged 1.07 times a sentence in the above experiment, 
the one word delay reduces the frequency to 0.74 times. In general, there exists a 
trade-off between the translation unit and the translation accuracy. This result 
shows that it is important to pursue an effective translation unit for simultaneous 
interpretation. 



3.3 Translation of Grammatically Ill-formed Sentences 

In Section 3.1, we have made the experiment on the assumption that all input 
is well-formed. However, such an assumption is not realistic for spontaneously 
spoken language translation. 

We have made an experiment with the same 4 dialogues consisting well- 
formed 181 sentences and ill- formed 97 sentences on the system having 391 
English words and 94 grammar rules. Figure 3 and 4 show the results of the 
translation without error correcting and with error correcting. As a result of 
introducing error correcting, 27 sentences providing a rate of 9.7% are newly 
added to the correct translation results. This result shows the error correcting 
method to be effective for the spontaneous speech translation. 
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Table 4. Translation results of 278 sentences with error correcting 



type 


sentences rate (%) 


A) correct (no repair) 


21 


7.6 


B) correct (repeurs) 


133 


47.8 


C) unnatural 


15 


5.4 


D) incorrect 


100 


36.0 


E) failed 


9 


3.2 



4 Japanese-English Translation 

This section describes the Japanese-English translation part of S 3 uic/Trans 
briefly. There exist some difficulties which should be overcome in developing 
a system of incremental Japanese-English translation. One of the difficulties is 
as follows: 

— In spite that the system should output the English verb at an early stage, 
the verb usually appears in the end of a Japanese sentence. 

As an example, let us consider translating the following Japanese sentence into 
English incrementally. 

(4.1) kinoo (yesterday) ken-wa (ken) mado-wo (the window) hanma-de (with 
an hammer) watta (broke). 

The standard translation of (4.1) is (4.2). 

(4.2) Ken broke the window with a hammer yesterday. 

It is impossible in principle to output the English verb phrase “broke the window 
with a hammer” before the Japanese verb “wattd' appears. 

To overcome the difiiculty, we are investigating to predict a English verb at 
an early stage. In the dialogue task restricted to some extent, it might be possible 
to predict the appearing verb from the other noun phrases in the sentence. If 
the verb “broke” can be predicted from “Ken” and “the window”, the system 
can product an English sentence (4.3) synchronously with (4.1). 

(4.3) Yesterday, Ken broke the window with a hammer. 

Figure 4 shows a comparison between the timing of the output of (4.2) and (4.3). 



5 Concluding Remarks 

This paper has described Sync/Trans, an speech-to-speech translation system 
which we have been studying. Sync/Trans has the features: synchronous archi- 
tecture, incremental translation, utilization of grammatical ill-formedness and 
early error correcting. 
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Input (4.1) 


Output (4.2) 


Output (4.3) 


kinoo 




yesterday 


ken-wa 


Ken 


Ken 


mado-wo 




broke 


hanma-de 




the window 


watta 


broke 

the window 
with a hammer 
yesterday 


with a hammer 



Fig. 4. Timing of the output of (4.2) and (4.3) 



This paper has provided a few experimental results on the English- Japanese 
part which we have implemented as a first step towards simultaneous inter- 
pretation. From the results, we have found that incrementally translating the 
source language in an appropriate unit and correcting the errors in the source 
sentence at an early stage, are effective for speech-to-speech translation. In the 
near future, as soon as the Japanese-English translation part is implemented, we 
are planning to evaluate Sync/Trans on a speech dialogue between an English 
speaker and a Japanese speaker. We might be able to confirm the effectiveness 
of Sync/Trans as a spontaneous dialogue interpreter through the evaluation. To 
this end, high-accuracy speech recognition and real-time language processing are 
essential. 
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Abstract. This paper analyzes the notion of a minimal belief change 
that incorporates new information. I apply the fundamental decision- 
theoretic principle of Pareto- optimality to derive a notion of minimal 
belief change, for two different representations of belief: First, for be- 
liefs represented by a theory — a deductively closed set of sentences or 
propositions — and second for beliefs represented by an axiomatic base 
for a theory. Three postulates ex 2 ictly characterize Pareto-minimal revi- 
sions of theories, yielding a weaker set of constraints than the standard 
AGM postulates. The Levi identity characterizes Pareto-minimal revi- 
sions of belief bases: a change of belief base is Pareto- minimal if and only 
if the change satisfies the Levi identity (for “maxichoice” contraction op- 
erators). Thus for belief bases, Pareto-minimaility imposes constr 2 unts 
that the AGM postulates do not. 

Keywords: belief revision, decision theory 



1 Minimal Theory Change 

New information changes our beliefs continually. How should we incorporate new 
assertions into a body of existing ones? This question arises in many situations of 
practical interest. For example, if the new assertion describes new data, incorpo- 
rating the evidence into current beliefs is an essential part of learning systems. 
If the new assertion is a datum presented to a database system, we face the 
question of how to update a database, and the same goes for knowledge bases. 

In the last two decades or so, the following principle has attracted much 
interest among computer scientists and logicians [3,8,6,10,2]: Revise your be- 
liefs so as to minimize the extent of change from the original beliefs. The aim 
of this paper is to analyze the notion of minimal belief change. I derive ax- 
ioms for minimal belief change from basic principles of decision theory. The 
same decision-theoretic principles lead to different results for different ways of 
formally representing beliefs. Specifically, I consider two such representations: 
Belief modeled as a deductively closed set of sentences (or propositions), and 
belief modeled by an axiomatic “belief base” . For each of these representations 
of belief, I consider the consequences of using the fundamental decision-theoretic 
principle of Pareto-Optimality to define minimal belief changes. 

Roughly, Pareto-minimal belief revisions are those that cannot be improved 
by adding fewer beliefs without giving up more, or by giving up fewer beliefs 

N. Foo (Ed.): AI’99, LNAI 1747, pp. 144-155, 1999. 
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without adding more. As it turns out, there is a purely set-theoretic definition 
of Pareto-minimal belief revisions in terms of the symmetric set differences be- 
tween the current theory and alternative revisions. The main theorem of this 
paper establishes that certain axioms for belief revision characterize Pareto- 
minimal theory changes, in the sense that a theory change is Pareto-minimal 
if and only if the change satisfies these axioms. The chief difference between 
Pareto- minimality and the standard AGM postulates [3] arises in the case in 
which the current theory neither entails the new information nor its negation. 
In that case, the AGM revision is the result of adding the new information to 
the current theory. Pareto-minimal revisions, however, may be logically weaker 
than the AGM revision.^ 

Pareto-optimality leads to different results for minimal revisions of belief 
bases, sets of sentences that need not contain all of their logical consequences. The 
well-known Levi identity characterizes Pareto-minimal changes of belief bases: I 
prove that they are exactly those that result from, first, retracting just enough 
basic beliefs to make the agent’s basic beliefs consistent with the new information 
(technically, a “maxichoice contraction” [3, Ch. 4.2]), and second, adding the 
new information to the basic beliefs contracted in this manner. Since AGM 
revisions may give up more beliefs than maxichoice contraction permits, this 
characterization shows that Pareto-minimality yields some constraints on the 
revision of belief bases that the AGM axioms do not require (cf. [1]). 

2 Theories 

Following much of the belief revision literature, I employ a syntactic representa- 
tion of an agent’s beliefs. However, all the developments to follow are valid for a 
semantic approach based on propositions (sets of models) as well. I assume that 
some language L has been fixed, and take a theory to be a deductively closed set 
of formulas from L. In Section 5 I considers belief sets that are not deductively 
closed. 

As is usual in belief revision theory, my assumptions about the structure 
of the language in which an agent formulates her beliefs are sparse; essentially, 
all I assume is that the language features the usual propositional connectives. 

I take as given a suitable consequence relation between sets of formulas in the 
language, obeying the standard Tarskian properties. The formal presuppositions 
are as follows. 

A language L is a set of formulas satisfying the following conditions. (!) L 
contains a negation operator -> such that if p is a formula in L, so is -<p. (2) L 
contains a conjunction connective A such that if p and q are formulas in L, 
so is p A q. (3) L contains an implication connective — > such that if p and q 
are formulas in L, so is p q. 

A consequence operation Cn : 2^ — » 2^ represents a notion of entailment 
between sets of formulas from a language L. A set of formulas F entails another 

^ In this respect, Pareto- minim^ll revisions agree with Katsuno and Mendelzon’s ap- 
proach to “belief update” [6]; see Section 4. 
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set of formulas F', written F V- F', iff Cn{F) D F'. A set of formulas F entails a 
formula p, written F p, iS p e Cn{F). I assume that Cn satisfies the following 
properties, for all sets of formulas F, F': Inclusion: F C Cn{F) \ Monotonicity: 
Cn(F) C Cn(F') whenever F C F'; and Iteration: Cn(Cn(F)) = Cn(F). 

A theory is a deductively closed set of formulas. That is, a set of formulas 
r C Z, is a theory iff Cn(T) = T. The entailment relation h is related to the 
propositional connectives as follows. 

Modus Ponens If T h p, {p—*q), then F q. 

Implication If T h g, then F {p ^ q). 

Deduction F U {p} 1- q' iff T 1- (p -+ g). 

Conjunction P h (p A o') iff both P h p and F\- q. 

Consistency Suppose that F\/ p. Then PU{-.p}l/p. 

Inconsistency {p A -^p} h L. 

Double Negation P h p iff P I — >->p. 

Classical propositional logic satisfies these assumptions. Belief revision the- 
orists usually assume that the consequence relation Cn is compact; none of the 
results in this paper require compactness.^ For the remainder of this paper, as- 
sume that a language L and a consequence relation Cn (and hence an entailment 
relation h) have been fixed that satisfy the conditions laid down above. 

3 Theory Change: Additions and Retractions 

My approach to defining minimal belief change is to seek a partial order -<t 
where we read Ti -<t T 2 as “Ti is a smaller change from T than T 2 is” . Since 
this ordering is partial, there may be possible changes that are incomparable. As 
far as a given partial order among theory changes goes, if two changes are incom- 
parable, we should view neither as a smaller change than the other. However, a 
theory change T 2 from an old theory T is not minimal if there is another, compa- 
rable, new theory Pi such that Pi -<t T^. Thus I shall take minimal changes from 
a current theory P to be the minimal elements in the given partial order -<t- 

I make use of decision-theoretic principles to define partial orders among 
theory changes. Let’s begin by distinguishing two kinds of change: A retraction 
in which the old theory entails a formula that the new theory does not entail, 
and an addition, in which the new theory entails a formula that the old theory 
does not entail. Thus T' retracts the formula p from P iff P I- p and P' 1/ p, 
and T' adds the formula p to P iff P 1/ p and T' 1- p. 

Next, I define two partial orders among theory changes. The first partial 
order defines a notion of a new theory T\ “retracting more” from a previous 
theory P than another new theory P 2 , namely if Pi retracts all the formulas 
from P that P 2 retracts from P, and Pi retracts at least one formula from P 
that P 2 does not retract. The second partial order defines a notion of a new 

^ A consequence relation Cn is compact iff for 2 iU formulas p and sets of formulas F, 
we have that p 6 Cn{F) only if p e Cn{F') for some finite subset F' of F. 
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theory T\ “adding more” to a previous theory T than another new theory T2, 
namely if Ti adds all the formulas from T that T2 adds to T, and T\ adds at 
least one formula to T that T2 does not add to T. It is not difficult to see that 
these notions can be expressed in terms of set inclusions as follows (c denotes 
proper set inclusion). 

Definition 1 . LetT,T\,T 2 he three theories. 

1 . Ti retracts more formulas from T than T2 does 4 => T — T2 C T — Ti. 

2. Ti adds more formulas to T than T2 does <==>■ T2 — TCT1 — T. 

We may think of the addition partial order and the retraction partial or- 
der as defining two distinct dimensions of “cost” in theory revision. If additions 
and retractions were linked such that minimizing one minimizes the other, this 
distinction would have no interesting consequences for the question of how to 
minimize theory change; we would just minimize both additions and retractions 
at once. What makes the distinction important is the fact that in general, addi- 
tions and retractions trade off against each other. Typically, avoiding retractions 
entails adding more sentences than necessary, and avoiding additions entails re- 
tracting more sentences than necessary. An example will clarify this point. 

Example. Imagine a cognitive scientist who believes that a certain AI sys- 
tem, say SOAR, is the only candidate for machine intelligence. This scientist 
believes that “if SOAR is not intelligent (-is), there is no intelligent machine 
(-im)”. Thus the scientist believes the sentence p = -is -+ -im. Suppose that 
the scientist believes only the consequences of p, that is, her current theory is 
T = Cn{{p}). In particular, the scientist neither believes that there is an in- 
telligent machine (m), nor does she believe that there is no intelligent machine 
(-ito). Now the scientist receives new information to the effect that SOAR is not 
intelligent. She has to revise her theory T on evidence -is. Let us consider two 
possible revisions, Ti and T2. Revision Ti adds the new information -is to T and 
accepts the deductive consequences of this addition; thus T\ = C 7 n({p} U {“is}). 
This revision T\ is logically stronger than T and hence retracts nothing from T. 
However, the revision adds the sentence -im (“there is no intelligent machine”), 
since p and -is entail ->m. 

Contrast this with a different revision T2 that retracts the scientist’s initial 
belief that SOAR is the only road to meichine intelligence, and adds the new 
information that SOAR is not intelligent. That is, T2 = Cn({-is}). This revi- 
sion T2 retracts more from T than T\ does. On the other hand, T2 adds less 
to T than T\ does, since T2 is strictly weaker than T\. In particular, T2 contin- 
ues to reserve judgment about whether machine intelligence is possible or not, 
whereas T\ concludes that it is impossible {-'m). 

As the results below show, this example illustrates a general tension between 
avoiding additions and avoiding retractions; essentially, additions and retractions 
trade off against each other unless the current theory already entails the new 
information. When additions and retractions stand in conflict, how shall we make 
trade-offs between them? This is the topic of the next section. 
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4 Pareto-Minimal Theory Change 

When a conflict arises between avoiding additions and avoiding retractions in 
belief revision, an agent may strike a subjective balance between them, as in 
any case of conflicting aims. She may assign one kind of change more subjective 
weight than the other, or favour some beliefs as more “entrenched” than oth- 
ers.^ But before we resort to subjective factors, we can look to decision theory 
for an objective constraint that applies to all agents seeking to minimize the- 
ory change. If avoiding changes is our aim, then we should avoid revisions that 
make more additions than necessary without avoiding retractions, and we should 
avoid revisions that make more retractions than necessary without avoiding ad- 
ditions. This is an instance of the following uncontroversial principle for rational 
choice under certainty between objects with multiple relevant attributes: If A 
is at least as desirable as B with respect to all relevant attributes, and A is 
strictly better than B with respect to at least one attribute, choose A over B. 
The decision-theoretic term for this principle is Pareto- optimality A For minimal 
theory change, we can render it as follows. 

Definition 2 . Let T,T\,T2 he three theories. T\ is a greater change from T 
than T2 is 4 =^ 

1 . Ti retracts more formulas from T than T2 does, and for all formulas p, ifT2 
adds p to T, then Ti adds p to T; or 

2 . T\ adds more formulas to T than T2 does, and for all formulas p, if T2 
retracts p from T, then T\ retracts p from T. 

An equivalent purely set-theoretic definition is; Ti is a greater change from T 
than T2 is iff T2 A T C Ti A T, where C denotes proper inclusion and A is 
symmetric difference {A A B — A — B U B — A).^ (I owe this definition to an 
anonymous referee.) 

Thus the principle of Pareto-Optimahty defines a partial relation -<t be- 
tween theories: T2 -<t Ti iff T\ is a greater change from T than T2 is. It seems 
that we can now take a minimal change from T to be a minimal theory in the 

^ Many investigators assume that a relation of “epistemic entrenchment” guides belief 
revision (e.g., Gardenfors and Nayak [3, Ch.4], [8]). They typically take epistemic 
entrenchment to be subjective in the sense that different rational agents may view 
the seime belief as entrenched to different degrees. 

Socieil choice theorists often use Pau'eto-optimality as a principle for comparing social 
states. The Pareto principle applies both to social choice and to choice between 
objects with multiple attributes because these two choice situations are formally 
equivEilent (identify the set of “attributes” with the set of individueJ members of 
society). 

® Chou and Winslett too define a partied order among (first-order) models of the 
form “N is closer to M them N' is” in terms of symmetric difference [2]. Prom the 
perspective of this paper, their definition is a specied case of Definition 2, namely 
Pareto-minimality applied to models rather tham theories. 
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Xx-ordering. But on that definition, the only minimal change from T is T it- 
self! Of course, it is generally true that the smallest change is no change, on 
any acceptable notion of “small change”. What we want is a minimal change 
that satisfies additional constraints. In the case of belief update, the additional 
constraint is that the minimal theory change should incorporate the new in- 
formation. Accordingly, I define a Pareto-minimal theory change from T, given 
new information p, as a theory that is minimal in the -<T-ordering among the 
theories that entail p. 

Definition 3. Let T, T\ he two theories, and let p be a formula. Then T\ is a 
Pareto-minimal change from T that incorporates p 4=^ 

1. T\ h p, and 

2. there is no other theory T 2 such that T 2 \~ p and T\ is a greater change 

from T than T 2 is. 

Now we are ready for the main result of this paper; Necessary and sufficient 
conditions for a theory revision to be a Pareto-minimal change. 

Theorem 1. Let T be a theory and let p be a formula. A theory revision T *p 
is a Pareto-minimal change from T that incorporates p 4=>- 

1. T *p\- p, and 

2. T LI {p} h T * p, and 

3. ifT\- p, then T *p = T. 

The theorem shows that the tension between additions and retractions arises 
whenever the agent’s current theory does not already entail the new informa- 
tion. When this is the case, the revisions that make Pareto-acceptable trade-offs 
run in strength from adding the evidence to the current theory (T U {p}) to 
entailing nothing but the evidence and its consequences ({p}). This account of 
minimal change distinguishes sharply between the case in which the current the- 
ory already entails the new information and the case in which it does not. The 
standard AGM axioms [3, Ch.3.3] also make a sharp distinction, but along a 
different line; They distinguish between the caise in which the evidence is con- 
sistent with the current theory (but not necessarily already part of it) and the 
case in which the evidence is inconsistent with the current theory. Specifically, 
the AGM axiom K*3 requires that T U p 1- T * p, which is the characteristic 
axiom of Pareto-minimal theory change. The postulate K’*'4 posits that if T U p 
is consistent, then T * p f- T U p. Thus the AGM axioms require the revised 
theory to be Cn{T U {p}) whenever p is consistent with T. In that case, the 
revision Cn{T U {p}) is a Pareto-minimal theory change, but it is just one of 
many possible Pareto-minimal revisions, namely the logically strongest one. 

Another theory of belief change that endorses K*3 but not K*4 is the “up- 
dating” approach [6]. Intuitively, the connection between Pareto-minimality and 
the Update operator is this; Katsuno and Mendelzon postulate that “an update 
method should give each of the old possible worlds [in which the previous theory 
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is true] equal consideration” [5, p.4j. Translating from possible worlds to sets of 
sentences, this means that Update treats adding new beliefs (removing possible 
worlds) as a “cost” in belief change, which can justify retracting previous beliefs 
(adding new possible worlds), even when the new information is consistent with 
the agent’s current theory (for an example, see [5, p.7|). 

Katsuno and Mendelzon argue that giving equal consideration to each of 
the old possible worlds is appropriate when an agent learns how the world has 
changed (update) rather than new facts about a static world (revision). This 
suggests that an agent’s attitude towards the relative importance of additions 
and retractions may depend on the context and content of her beliefs. Pareto- 
minimality weights additions and retractions equally; in other contexts we may 
wish to give priority to minimizing retractions.® In the limiting case, we give 
absolute priority to minimizing retractions first, and only then consider avoiding 
additions. It can be shown that an agent’s theory revision satisfies K*4 if and 
only if the agent makes the trade-off between additions and retractions in this 
way. 



5 PciretoMinimal Revision of Belief Bases 

So far I have treated all of an agent’s beliefs as equally important. A more refined 
representation of the agent’s epistemic state may distinguish between a “basic” 
set of beliefs B, and the consequences of B that the agent might be said to hold 
because he believes B7 Hansson endorses the distinction between a basic set of 
beliefs and their consequences as a “small step toward capturing the justificatory 
structure” of an agent’s beliefs [4]. I shall take a base for a theory T to be a set 
of formulas B, which may or may not be deductively closed, such that B T. 
(For more on belief bases, see [9,10] and the references therein). 

To define Pareto-minimal revision of belief bases, I begin again with two ways 
of making a change to a belief base. If B, B' are two bases, I say that B' retracts 
the formula p from B iSp € B and p ^ B\ and that B' adds the formula pto B 
iS p ^ B and p € B'. The definition of “adding more” and “retracting more” 
from a base is just like that for theories (cf. Definition 1). Thus Bi retracts 
more formulas from B than B 2 iS B — B 2 C B — Bi, and Bi adds more 
formulas to B than B 2 iS B 2 - B c Bi — B. 

As with Definition 3, we can apply the principle of Pareto-optimality to define 
a partial comparison of base revisions with respect to the extent of change that 
they induce. 

Definition 4. Let B,Bi,B2 be three bases. Then Bi is a greater change 
from B than B 2 is <=> 

® Levi presents a theory of how an agent may minimize the loss of “damped informar 
tional v^Jue” [7, Ch.2.1]. In my terms, this is advice for how to retract some beliefs 
to avoid eulding too many. 

A pmeidigm example is a database, where we may distinguish between the records 
that are explicitly stored in the database and what follows from the explicitly stored 
information. 
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1. B\ retracts more formulas from B than B2 does, and for all formulas p, 
if B2 adds p to B, then Bi adds p to B; or 

2. B\ adds more formulas to B than B2 does, and for all formulas p, if B2 
retracts p from B, then Bi retracts p from B. 

As with Definition 2 , an equivalent set-theoretic definition is that Bi is a 
greater change from B than B2 is iflt B2 A B C Bi A B. 

When we consider the extent of change of a belief base, it is natural to take 
into account only changes in basic beliefs, not changes in the logical consequences 
of the basic beliefs that “just follow” from them. My definition of retracting and 
adding to a belief base expresses this view of minimal belief change by consider- 
ing only which sentences are added to or retracted from the set of basic beliefs. 
For example, there may be a sentence q such that B * p \- q and B \/ q but 
q ^ B*p. In that case the revision B*p adds q to the logical consequences of the 
agent’s beliefs, but does not add q to her basic beliefs. In effect. Definition 4 does 
not count such additions to the consequences of the agent’s basic beliefs as an 
addition, unless they are also additions to the agent’s basic beliefs themselves. 
Discounting changes in the logical consequences of basic beliefs in this way gives 
rise to a fundamental difference between the Pareto-minimal revision of basic 
beliefs and Pareto-minimal theory change; Pareto-minimal base revisions never 
add basic beliefs to the previous ones other than the new information. For sup- 
pose that a revision B*p adds a belief g to a base B; then B*p — {g} adds less 
to B and retracts no more. Hence B * p is not a Pareto-minimal change of B. 
In contrast, a theory revision T *p will typically add many beliefs to T, namely 
logical consequences of previous beliefs conjoined with the new information p. 
Another way to put the point is that for bases a conflict between additions and 
retractions does not arise: it is possible to minimize both additions and retrac- 
tions at the same time. In the case in which the new information contradicts the 
current basic beliefs, this will lead an agent to hold inconsistent beliefs. Since 
many researchers accept as a general norm of epistemic rationality that an agent 
ought to avoid inconsistent beliefs, I shedl restrict Pareto- minimal revisions to 
consistent bases. 

Definition 5 . Let B, Bi be two bases, and let p be a formula. Then B\ is a 
Pareto-minimal consistent change from B that incorporates p 

1. p € Bi, and 

2. B\ is consistent, and 

3. there is no other consistent base B2 such that p S B2 and B\ is a greater 
change from B than B2 is. 

What are the characteristic properties of Pareto-minimal base revisions? It 
turns out that a version of a proposal originally due to Levi amounts to necessary 
and sufficient conditions for a base revision to be Pareto-minimal and consistent. 
The proposal is to think of a Pareto-minimal revision of a belief base B on new 
information p as proceeding in two steps: First, remove just enough beliefs from B 
to obtain a belief base B' that is consistent with p; then add p to B'. Formally, 
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we require that B' be a belief base that is consistent with p — thus B' F -ip — and 
removes as few beliefs from B as possible. Hence I define a retraction-minimal 
contraction of a belief base as follows. 

Definition 6. Let B, B\ be two bases, and let p be a formula. Then Bi is a 
retraction-minimal contraction from B on p 4=>- 

1. Bi C B, and 

2. Bi I/p, and 

3. there is no other base B 2 such that B 2 V P retracts more from B 

than B 2 does. 

Retraction-minimal contractions of a base B on new information p have a 
simple characterization; They are exactly those subsets of B that cannot be 
expanded without entailing p. (The proof is left to the reader.) 

Lemma 1. Let B,B\ be two bases such that Bi C B, and let p be a formula. 
Then Bi is a retraction-minimal contraction from B on p for all for- 
mulas q, if B\ retracts q from B, it is the case that B\ U {g} H p. 

Thus retraction-minimal contractions are those that belief revision theorists 
refer to as “maxichoice contractions” [3, Ch.4.2]. The Levi identity says that 
minimal revisions of a belief set K given new information p are the result of 
adding p after contracting K on -ip (see [3, Ch.3.6]). The next proposition shows 
that the Levi identity for retraction-minimal (maxichoice) contractions charac- 
terizes Pareto-minimal revisions of belief bases that lead to consistent belief 
bases. 

Theorem 2 (The Levi Identity for Belief Bases). Let B be a base and 
let p be a formula. Suppose that a revision B * p contains p. Then B * p is 
a Pareto-minimal consistent change from B that incorporates p there is a 
retraction-minimal contraction B' from B on ~^p such that B*p = B' U {p}. 

I omit the proof for space reasons. In view of Theorem 2, it is not difficult 
to see that Pareto-minimal consistent revisions of belief bases satisfy the AGM 
axioms K*l-K*5 (interpreted for base revisions with D in place of H; see also 
[1, Part II]).® The converse is not true, however: Pareto-minimality places more 
constraints on the revision of belief bases than K*l-K*5, since AGM revisions 
need not be the result of maxichoice contractions and hence may give up more 
beliefs than Pareto-minimal revisions. 

® For K*2 I require that p € B * p. For K‘*‘5 we must assume that the underlying 
consequence relation F is consistent in the sense that 0 P L; otherwise there is no 
consistent base. When the new information p is inconsistent, there is no consistent 
revision on p; in that case I require that B *p is an inconsistent base in aw:cordance 
with K*5. 
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Alchourron and Makinson conjectured that “when applied to bases that are 
irredundant, choice contraction and revision functions serve as good formal rep- 
resentations of the corresponding intuitive processes” [1, p.21]. Theorem 2 es- 
tablishes a formal version of this conjecture, in which Pareto-minimality takes 
the place of “intuition” 



6 Conclusion 

The principle of minimal belief change is an important and influential idea in sev- 
eral areas of computer science such as artificial intelligence and database theory. 
This paper showed a method for rigorously deriving axioms for minimal belief 
change from fundamental decision-theoretic principles. This approach clarifies 
the foundations of belief revision postulates; and it allows us to distinguish uni- 
versally valid postulates from those whose applicability depend to a larger extent 
on the details of how we represent beliefs and the relative weight we assign to 
retractions and additions in a given application domain. 

Specifically, with regard to beliefs represented by deductively closed theories. 
Theorem 1 shows that the AGM axiom K*3 is universally valid for Pareto- 
minimal belief change, whereas the axiom K*4 is not. 

With regard to beliefs represented by belief bases — which need not be de- 
ductively closed — Pareto-minimality validates K*4 and other staples of belief 
revision theory such as the Levi identity. In fact, Pareto-minimal base revision 
obeys constraints that go beyond the AGM postulates. Thus the results of my 
analysis of base revision largely agree with previous work; however, my method 
is diff'erent: I do not appeal to intuition, or even representation theorems, for 
justifying belief revision maxims, but instead derive them from fundamental 
decision-theoretic principles. 

Altogether, the results in this paper show that Pareto-minimality provides a 
fruitful and principled decision-theoretic foundation for postulates guiding min- 
imal belief revision. 



7 Proof of Theorem 1 

Theorem 1. Let T be a theory and let p be a formula. A theory revision T *p 
is a Pareto-minimal change from T that incorporates p 

1. T *p\- p, and 

2. T\J {p} [- T *p, and 

3. if TV- p, then T *p = T. 

Proof (=J>) Part 1: Immediate from Definition 3. Part 2: I show the contra- 
positive. Suppose that TU{p} \/T*p. Then there is a formula q in T*p such that 

® Nebel also argues for constructing minimeJ base revisions from maxichoice contrac- 
tions followed by adding the new information [9, Secs. 7, 8]. 
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T U {p} 1/ q. So {&) T\/ q by Monotonicity. Now consider T' = {T *p)r\ Cn{T U 
{p} 0 {“’9})- First I note that T' is closed under deductive consequence. For let 
r e Cn{{T*p) n Cn(T U {p} U {-'9})). Then by Monotonicity, r e Cn(T*p) and 
r e C'n(Cn(Tu{p}U{-'9})). We assumed that T*p is closed under consequence, 
and Iteration implies that Cn{Cn{T U {p} U {“'9})) = Cn{T U {p} U {19}); thus 
r eT *p n Cn(T U {p} U {~'9}). This shows that Cn{T') = T' . 

Next, note that (b) T' \/ q because Cn{T U {p} U {-'9}) 1/ 9 by Consistency 
(applied to TU{p}) and Iteration; thus from Monotonicity and the fact that T' C 
C'n(TU {p} U {“>9}), it follows that T' \f q. Moreover, we have from Monotonicity 
and the fact that T' CT *p as well that (c) if T' adds a formula to T, so does 
T *p. From (a), (b) and (c) it follows that (d) T*p adds more formulas to T 
than T'. 

Now I show that (e) T' retracts from T exactly the formulas that T*p retracts 
from T. Monotonicity implies immediately that if T*p retracts a formula from T, 
so does T'. For the converse, suppose that T' retracts a formula r from T. Since 
Cn(TU{p}U{-i5}) h T, this implies that r ^ {T*p). And that means that T *p 
retracts r from T as well. 

Finally, we have that (f) T' I- p, since T * p I- p by Part 1 and clearly 
Cn{T U {p} U {-19}) I- p. Together, (a)-(f) establish that T' incorporates p and 
T*p is a greater change from T than T' is. Hence T*p is not a Pareto-minimal 
change. 

Part 3: Immediate, since every theory other than T retracts or adds more 
formulas to T than T itself does. 

{<=) Suppose that T *p satisfies conditions 1, 2 and 3. Then the claim is 
immediate if Th p and T *p = T; suppose that T 1/ p. I show that T * p is not 
a greater change from T than any other change T' that incorporates p. 

First, suppose that T *p retracts a formula q from T but T' does not, such 
that T' h q. Then T' (p A q) by Conjunction, whereas T * p 1/ (p A g) by 
Conjunction as well. Since we supposed that T I/p, it follows that T (pAq) by 
Conjunction once more. So T' adds a formula to T — namely p A q — that T *p 
does not add to T, and hence T *p is not a greater change from T than T' is. 

Second, suppose that T *p adds a formula q to T, but T' \f q. Condition 2 
asserts that T U {p} I- T + p and hence (7n(T U {p}) h q. By Deduction, we have 
that (a) T p —* q. Moreover, Implication implies that (b) T * p\- p q. Also, 
(c) T' \/ p ^ q. For suppose that on the contrary, T' \- p q. Then since T' h p, 
it follows from Modus Ponens that T' h q, contrary to assumption. From (a), (b) 
and (c) we have that T' retracts a formula from T — namely p -+ q — that T *p 
does not retract from T. Thus T * p is not a greater change from T than T' is. 

These arguments establish that if T * p satisfies conditions 2 and 3, then 
there is no theory T' incorporating p such that T * p is a greater change from T 
than T' is. From Condition 1 it follows that T * p is a Pareto-minimal change 
from T that incorporates p. □ 
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Abstract. Generalisations of theory change involving arbitrary sets of 
wffs instead of belief sets have become known as base change. In one 
view, a base should be thought of as providing more structure to its 
generated belief set, and can be used to determine the theory change op- 
eration associated with a base ch 2 uige operation. In this paper we extend 
a proposal along these lines by Meyer et al. [12]. We teike an infobase as 
a finite sequence of wffs, with each element in the sequence being seen 
as an independently obtained bit of information, and define appropriate 
infobase change operations. The associated theory change operations sat- 
isfy the AGM postulates for theory change [Ij. Since an infobase change 
operation produces a new infobase, it allows for iterated infobase change. 
We measure iterated infobase change against the postulates proposed by 
Darwiche et al. [2,3] and Lehmann [10]. 



1 Introduction 

It is generally accepted that belief sets do not have a rich enough structure to 
serve as appropriate models for epistemic states [8], [6], and theory change is 
therefore regarded as an elegant idealisation of a more general theory of belief 
change, involving arbitrary sets of wffs known as bases. In one view, a base should 
be thought of as providing more structure to its associated belief set. The added 
structure of the base can be used, in one way or another, to pick an appropriate 
associated theory contraction operation. This, in turn, can be used to aid in the 
process of constructing a range of suitable base contraction operations. Recently, 
Meyer et al. [12] proposed a form of base change along these lines. They regard 
an infobase as a finite set of wffs consisting of independently obtained bits of 
information. Taking AGM theory change [1] as the general framework in which 
to operate, they present a method that uses the structure of an infobase to 
determine which AGM theory change operation to associate with the infobase 
change operation to be constructed. In this paper we improve on the proposal 
by Meyer et al. [12] in two ways. Firstly, and in line with the claim by Meyer 
et al. [12] that the definition of an infobase as a finite set of wffs is in conflict 
with the intuition of independently obtained wffs,^ we view an infobase as a 
finite sequence of wffs. This has a number of favourable consequences. Secondly, 

^ See [11,12] for a justification of this ckiim. 

N. Foo (Ed.): AI’99, LNAI 1747, pp. 156-167, 1999. 
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the approach of Meyer et al. [12] associates a unique infobase contraction and 
revision operation with every infobase. We generalise this approach by allowing 
for a whole spectrum of infobase contraction and revision operations obtained 
from a given infobase, ranging from a “foundational” approach at one extreme 
to a “coherentist” approach at the other. 

2 Preliminaries 

For the rest of this paper L denotes a finitely generated propositional language, 
closed under the usual propositional connectives, and with a classical model- 
theoretic semantics.^ The set of interpretations of L is denoted by U. For every 
X C L, we denote the set of models of X by M{X), and for a G L we write 
M(a) instead of Classical entailment is denoted by t=. Closure under 

entailment is denoted by Cn. A theory or a belief set is a set A" CL closed 
under entailment. For every V C 17, we let Th{V) denote the theory determined 
by V. For every V CU, we let Fy denote some wff for which M{Fy) = V. It is 
well-known that such a finite axiomatisation exists for every V CU. 

An infobase will be represented as a finite sequence of wffs enclosed by square 
brackets. Although infobases are sensitive to the order in which wffs occur, as well 
as to their syntactical form, we shall see that these superficial qualities can be 
done away with by employing the notion of element-equivalence. Two infobases 
IB and IC are element- equivalent, written as 7B w IC, iff for every (3 occurring 
in IB such that /?, there is a unique logically equivalent wff 7 occurring in 
IC, and for every 7 occurring in IC such that 7 , there is a unique logically 
equivalent wff 0 occurring in IB. We shall sometimes abuse notation slightly 
by applying the notion of element-equivalence to sets instead of infobases. For 
a finite sequence a of wffs, we use the symbol • to denote concatenation by a 
single wff. The converse of concatenation (removing the last wff from a finite 
sequence a) will be denoted by V. For a finite sequence a of wffs, the set of wffs 
occurring in a is denoted by S{a). That is, S{cr) = {0 \ 0 occurs in cr}. We say 
that an infobase IB is associated with a belief set K (and K is associated with 
IB) iff Cn{S{IB)) = K. 

Formally, we consider infobase change operations (which include contraction 
and revision operations) as functions from TB x L to TB, where IB is the set of 
all infobases. We shall also frequently assume the existence of a fixed infobase 
IB, and consider infobase /B-change operations as functions from L to IB. 

From results by Grove [7] and Katsuno et al. [9], AGM theory change can 
be characterised by a set of total preorders (i.e. connected, reflexive, transitive 
relations) on U. For a total preorder on U, we say that x £ V C U is 
minimal in V iff for every y & V, x :< y, and we denote the set of :^-minimal 
elements of M{(f)) by Min^{d>)- For X C L, :< is X -faithful iS x ^ y for every 
X G M{X) and y ^ M{X), and x y for every x,y G M{X). The required 
results are obtained in terms of the following two identities: 

^ Meyer et al. [11] provides a treatment of basic infobase change involving a larger 
class of logics. 
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(Def - from K-a^ Th{M{K) U Min^{-^a)) 

(Def * from :^) K *a = Th{Min^{a)) 

Theorem 1. 1. Every K-faithful total preorder defines an AGM theory con- 
traction using (Def — from :<). Conversely, every AGM theory contraction 
can be defined in terms of a K-faithful total preorder using (Def — from -<). 
2. Every K -faithful total preorder defines an AGM theory revision using (Def 
* from -<). Conversely, every AGM theory revision can be defined in terms 
of a K-faithful total preorder using (Def* from -<). 

The following two identities can be used to define AGM theory revision and 
theory contraction in terms of one another. 

(Harper Identity) K — 4> — KC\{K* ->4>) 

(Levi Identity) K *<j) = {K — ^<j>) + (j) 

With the exception of theorem 1, which is a well-known result by now, the proofs 
of the results in this paper can all be found in [11]. 

3 Infobase Change 

To construct an infobase contraction, we first use the structure of the infobase 
IB to obtain an 5(/B)-faithful total preorder. The theory contraction obtained 
from the 5(/B)-faithful total preorder is taken to be the theory contraction 
associated with the infobase contraction that we aim to construct. 

Definition 1. For every infobase IB, a theory contraction — is associated with 
an infobase IB -contraction © iff Cn{S{IB)) — a = Cn{S{IB 0 a)) for every 
a£L. 

Using the intuition associated with an infobase, we order the interpretations in U 
according to the number of wffs of IB they satisfy; the more they satisfy, the 
“better” they are deemed to be, and the lower down in the ordering they will 
be. 

Definition 2. For u € U, we define ujb, the /B-number of u, as the number 
of wffs 0 in IB such that 1^ (3 and u € M{0). 

This ordering is used to obtain an appropriate 5'(/S)-faithful total preorder in 
terms of IB as follows: 

(Def :< from IB) u :<v ifl vjb < ujb 

Definition 3. We refer to the faithful total preorder :<jb defined in terms of an 
infobase IB using (Def :< from IB) as the /B-induced faithful total preorder. 

The /B-induced faithful total preorder is used to construct a theory contraction 
as follows: 
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(Def -IB from IB) Cn{S{IB)) -ib a = Th{M{S{IB)) U Min^,^{-^a)) 

Definition 4. The theory contraction —ib defined in terms of an infobase IB 
using (Def —ib from IB) is referred to as the /^-induced theory contraction. 

Clearly the /B-induced theory contraction is an AGM theory contraction. As- 
sociating the 7B-induced theory contraction with the infobase /B-contraction 
allows us to determine which wffs in IB should be retained, and which cannot 
be retained, after a contraction of IB. 

Definition 5. The set o/ a-discarded wffs (of an infobase IB) is defined as 
IB~°‘ = {/? G S(7B) 1 /3 ^ Cn{S{IB)) -ib a}. We refer to S{IB) \ IB~°‘ as 
the set o/a-retained wffs (of IB). 

The a-retained wffs are precisely the wffs in IB that should be retained when 
contracting IB by a, while the a-discarded wffs are replaced with appropriately 
weakened wffs. In deciding on an appropriate method for the weakening of the 
Q-discarded wffs, it is necessary to strike the right balance between what we 
tentatively refer to as a coherentist approach, emphasising knowledge level mat- 
ters, and a foundationalist approach, emphasising the independence of the wffs 
occurring in IB. The following example serves to make these matters concrete. 

Example 1. Consider the infobase IB = \p,q,r]. Figure 1 gives a graphical rep- 
resentation of the 7B-induced faithful total preorder :<ib- The wffs p, q and r 
each represents independently obtained information. So, when contracting IB by 
pAq, the resulting infobase should contain weakened versions of the two (p A q)- 
discarded wffs p and q, and should contain the (p A 5 )-retained wff r itself. But 
what should the weakened versions of p and q look like? An application of the 
coherentist approach on a local level suggests that, in order to minimise the loss 
of information, one should add only the minimal models of “'(pAg) to the mod- 
els of both p and q, and let the corresponding wffs be the appropriate weakened 
versions. The weakened version of p would be logically equivalent to p V (g A r) 
and the weakened version of g would be logically equivalent to g V (p A r). On 
the other hand, the foundationalist approach, which stresses the independence 
of the wffs in 7B, suggests that the presence of r should have no effect on the 
weakened versions of p and g. In this view, the wff p V g (or any wff logically 
equivalent to it) would be a suitable choice for the weakened versions of both p 
and g. 

There does not seem to be a definite answer to the question of which one of these 
two approaches to infobase change is the “correct” one. They should rather be 
seen as opposites on a whole spectrum of possibilities. The coherentist approach 
can be described as the case where all the wffs in IB play a role in deter- 
mining the weakened versions of the a-discarded wffs, while the foundationalist 
approach ensures that only the set of a-discarded wffs themselves is involved in 
the construction of their weakened versions. Given these two opposites, it also 
seems perfectly reasonable to allow for any set of wffs in between (i.e., containing 
the a-discarded wffs and included in S(7B)) to be involved in the construction 
of the weakened versions of the a-discarded wffs. 
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Fig. 1. A graphical representation of the /B-induced faithful total preorder -<ib, 
with IB = \p,q,r\. For every u,v € U, u :<ib v iS (u,v) is in the reflexive tran- 
sitive closure of the relation determined by the arrows. Interpretations are rep- 
resented as ordered triples of Os and Is, 0 representing falsity and 1 representing 
truth. The first digit in a triple represents the truth value of p, the second the 
truth value of q and the third the truth value of r 



Definition 6. Given an infobase IB and a wffa, a set R is said to be {IB,a)- 
relevant iff IB~°‘ C RC S{IB). 

Our goal is to ensure that, in the process of obtaining the weakened versions 
of the a-discarded wffs, the effect of the wffs not in the {IB, a)-relevant set R 
are neutralised. To do so, we should not just add the :</s-minimal models of 
-la, but also any other models of -la that behave exactly like the :</B-niinimal 
models with respect to the wffs in R, but that might differ from the z^/e-minimal 
models on the truth value of the wffs in S{IB) \ R. 

Definition 7. ForX C L and u,v & U, u is A’-equivalent to v, written u =x v, 
iff for every X, u e M{x) iff v & M{x)- 

In general, we obtain the weakened version of every a-discarded wff /3 as follows. 
We need some appropriate set of interpretations that can be added to the models 
of (3 to obtain the set of models of its weakened version. Once we have decided 
on an (/S, a)-relevant set R, we use the set of minimal models of -la as our 
starting point and then try to expand it so that only elements in R have any 
influence, thus neutralising the possible influence of any of remaining wffs in IB. 
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This is accomplished by including all the models of -la that are J?-equivalent to 
some minimal model of -la. 

Definition 8. Let R be any {IB, a) -relevant set. For u G Min^jg{-^a), let 
= {u G M{^a) \v =Ru} and Nj%{-^a) = \JueMin^,g{^a) ^uho‘)- 
We refer to NPg{-ia) as the (i?, a)-neutralised models of IB. 

We take the (R, a)-neutralised models as the set of interpretations to be added 
to the models of each a-discarded wff. We can think of the {R, a)-neutralised 
models as a set of interpretations in which the influence of the wffs not in R 
has been removed, but in which the wffs in R have the same impact as on the 
minimal models of -lOi. 

It turns out that there is an elegant way to provide a uniform description 
of infobase contraction. We can describe it as a process in which all the wffs in 
the current infobase are replaced with weaker versions, but where the “weaker” 
version of every a-retained wff turns out to be logically equivalent to the wff 
itself. 

Definition 9. Let R be any {IB, a) -relevant set. For every j3 G S{IB), we 
let N^{-^a) = (“'“)• «« ^he 

(i?, a, /3)-neutralised models of IB. 

The next proposition shows that an a-retained wff (3 has no {R, a, /3)-neutralised 
models, and that, for an a-discarded wff adding the (i?, a, /3)-neutralised 
models to the models of /3, has the same effect as adding the {R, a)-neutralised 
models. 

Proposition 1. Let R be any {IB, a) -relevant set. 

1. If I3e S{IB) \ IB~°‘ then W^(-a) = 0. 

2. If 13 G IB-°‘ then M{p) U W^(--a) = M(/3) U NPjg{^a). 

We are now almost in a position to define basic infobase contraction. 
Definition 10. A function rs : TBy.pL — » ppL is a relevance selection yiinction 

iff 

1. IB-°‘ C rs{IB,a) C IB, 

2. if a = (3 then rs(lB,a) = rs{IB,j3), and 

3. if IB ss IC then rs{IB,a) w rs{IC,a). 

Intuitively, a relevance selection function indicates which of the wffs in IB should 
play a role in determining the weakened versions during a contraction. Observe 
that rs{IB,a) is (7S, a)-relevant. 

Definition 11. 1. An infobase change operation Q is a basic infobase con- 
traction iff there is a relevance selection function rs such that, for every 
IB G TB and every a E. L, IB © a is obtained by replacing every wff (3 in 
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2. For every IB € IB, an infobase IB-change operation Qjb is a basic infobase 
/^-contraction iff it can be obtained from a basic infobase contraction © by 
fixing the infobase IB. That is, iff IB Qib — IB G a for every a G L. 

It can be verified that for the infobase IB = \p,q], for example, there is at 
least one basic infobase contraction 0 such that IB Q p « [T,q] and IB Q 
{p /\q) ~ [p V V 5], and at least one basic infobase contraction 0' such that 
JB 0'p « [p Vg,5] and IB Q' (p Aq) fti \pV q,pV q]. 

Basic infobase revision is defined by an appeal to the following infobase ana- 
logue of the Levi Identity: 

(Def © from ©) IB ®a = {IB © -^a) • a 



Definition 12. An infobase change operation is a basic infobase revision iff it 
is defined in terms of a basic infobase contraction using (Def © from © ). 



4 Properties of Basic Infobeise Change 

Given the intuition associated with infobase change, it is to be expected that the 
/jB-induced theory contraction is the theory contraction associated with every 
basic infobase /JB-contraction. The next result shows that this is indeed the case, 
and that a similar result holds for basic infobase revision. 

Definition 13. A theory revision * is associated with an infobase IB-revision 
© iff Cn{S{IB)) *a = Cn{S{IB ® a)) for every a £ L. 

(Def *iB from IB) Cn{S{B)) *jb a = Th{Min^,^{a)) 

Definition 14. The theory revision *jb defined in terms of an infobase IB 
using (Def *jb from IB) is referred to as the /B-induced theory revision. 

For the remainder of this section we assume © be a basic infobase contraction, 
and we let ® be the basic infobase revision defined in terms of © using (Def © 
from ©). 

Proposition 2. Cn{S{IB)) -jb a = Cn{S{IB © a)) and Cn{S{IB © a)) = 
On(S(IB)) */B o:. 

The next result, which follows straightforwardly from the construction, shows 
that the syntactic form of the wffs in an infobase, as well as the form of the wff 
with which to contract or revise, are irrelevant. 

Proposition 3. If IB « IC and a = f3 then IB ® a ps IC ® (3 and IBQ (3 ^ 
ICG'). 
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We regard this property of infobase change as an advantage over some of the 
other, more syntactically-oriented, approaches to base change. 

In the context of infobase change, reason maintenance [4] amounts to ensur- 
ing that the contraction of /B by a wff a in IB results in the removal of all 
the wffs that are dependent on a for being in Cn{S{IB)). Fuhrmann [5] has 
given a precise meaning to the idea of a wff being dependent on a (for being in 
Cn{S{IB))). 

Definition 15. A wff (3 & L is /B-dependent on a iff a 6 S{IB) and j3 € 
Cn{S{IB)), but p i Cn{S{IB) \ {a}). 

The next result shows that basic infobase change incorporates reason mainte- 
nance. 

Proposition 4. If /3 is IB-dependent on a then !3 ^ Cn(S(IB © a)) and /3 ^ 
Cn(5(B©-io)). 

A related question is whether, if a is in IB and o ^ Cn{S{IB 0 7)), it will 
be the case that /3 ^ Cn(S{IB 0 7)) for every (3 that is /B-dependent on a. 
This property is known as Fuhrmarm’s filtering condition [5]. It is easy to see 
that basic infobase contraction can violate the filtering condition. But, given the 
intuition associated with infobases, the filtering condition is clearly too strong 
a requirement to impose. For it requires that for any infobase contraction ©, 
Cn(5(/B©7)) = Cn(T) for any singleton infobase IB and any 7 e Cn(S{IB)) 
(where 7), thus leaving no room for weakening the wff in IB to anything but 
a logically valid wff. 

Finally, it is also possible to provide a result for infobase change which is 
reminiscent of the Harper Identity. 

Proposition 5. Let ® be a basic infob ase revisi on, and let © be an infobase 
change operation such that IB ©aw 'iB ® -la. Then O is a basic infobase 
contraction. 

5 Iterated Infobase Change 

Although an infobase IB induces the unique theory contraction — /b, infobases 
do not contain enough information to determine a basic infobase contraction or 
revision. To do that, we also need a relevance selection function rs. Once rs is 
fixed, though, we are dealing with a specific basic infobase contraction and revi- 
sion, which allows for the possibility of iterated infobase change. In this section 
we investigate whether iterated infobase change measures up to the postulates 
supplied by Darwiche et al. [2,3] and Lehmann [10]. To do so, we have to work 
on the level of epistemic states.^ Following Darwiche and Pearl we assume that 
every epistemic state # has associated with it a belief set K{^) and a K{^)- 
faithful total preorder To bring infobase change into this framework, we 

® The use of epistemic states have been evlvocated by a number of authors, including 
Darwiche euid Pearl [3] eind Lehmann [10]. 
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assume that it is possible to extract a unique infobase IB^ from every epistemic 
state This implies that K{^) = Cn(S(IB^)) and that is identical to the 
/5$-induced faithful total preorder :<ib^- Note, however, that infobases contain 
more information than such ordered pairs. For example, letting IB = [p, 9 ] and 
IC = \p q], it is easy to check that Cn{S{IB)) = Cn{S{IC)), and that 

■^IB and -:<ic are identical. Furthermore, it is also easy to establish that every 
ordered pair of this kind can be obtained from some infobase. (See p. 26 of [11]) 
More importantly, perhaps, is the fact that the extra information contained in 
infobases plays an important role in the process of infobase change, as the next 
example shows. 



Example 2. Let © be the basic infobase contraction obtained from the relevance 
selection function rs, where rs{IB,a) = IB~°‘, for every IB e JB and every 
a £ L, and let 0 be the basic infobase revision defined in terms of 0 using 
(Def 0 from 0 ). Let IB = \p,q] and let IC = \p f\q,p,q,p^ q,P q,q —* p]- 
Clearly Cn{S{IB)) = Cn{S{IC)) and it is easily established that -<ib and :<ic 
are identical. Yet, it can be verified that IB ®{p A ->q) « [p, T,p A - 19 ] and that 
/C 0 (p A -iq) « biPiP V g,p V 9, T,g — ♦ p,p A - 19 ] . It can then also be verified 
that IB 0 (p A -<q) and IC 0 (p A -^q) induce different faithful total preorders. 



Having established that epistemic states need to have a richer structure than 
ordered pairs of the form we now turn to the definition of revision 

on epistemic states in terms of basic infobase revision. 



(Def * from 0) 



K{$^a) = Cn{IB^ 0 a) 



Definition 16. We refer to the revision on epistemic states defined in terms of 
a basic infobase revision 0 using (Def ^ from ®) as the 0 -associated revision 
on epistemic states. 

5.1 DP-Revision 

In two influential papers, Darwiche et al. [2,3] argue that belief change ought to 
be conducted on the level of epistemic states and, in addition to modifying the 
AGM postulates appropriately, propose the following additional four postulates 
for iterated revision: 

(DPI) If a t= ,3 then ^ j3) ^ a) = K{^ * a) 

(DP2) If a N -./3 then K{{^ ^ /3) ^ a) ^ K{^ ^ a) 

(DP3) If /3 e AT(<? * a) then 6 ^ 0) ^a) 

(DP4) If -.3 ^ K {^ ^ a) then -.3 ^ K {{0 ^ 0) * a) 

When placed in this framework, basic infobase revision yields favourable results. 
The revisions on epistemic states associated with basic infobase revisions satisfy 
all but the first one of the four DP-postulates. This is in marked contrast with 
the version of infobase revision described in [ 12 ] which does not satisfy any of 
these postulates. 
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Proposition 6. Let ® he a basic infobase revision, and let * be the ®- associated 
revision on epistemic states. Then ^ satisfies (DP2)-(DP4), but does not nec- 
essarily satisfy (DPI). 

It is our contention that the violation of (DPI) by basic infobase revision is an 
indication that this postulate is perhaps too restrictive to accommodate a wide 
range of rational forms of revision. Below we give a realistic example in support 
of this claim. ^ 

Example 3. I have a circuit containing two components; an adder and a multi- 
plier. I have made three independent observations about these components: (1) 
The adder is working, (2) the multiplier is working, and (3) if the adder doesn’t 
work then the multiplier also doesn’t work. Another observation now indicates 
that at least one of the two components is not working. In trying to incorporate 
this new information, we have to discard (or weaken) at least one of the first two 
observations. Moreover, we cannot retain both observations (2) and (3), for they 
imply observation (1). So it seems reasonable to retain the belief that the adder 
is working and the belief that a broken adder implies a broken multiplier. To- 
gether with the new information that at least one of the components is broken, 
it then follows that it is the multiplier that is broken. 

This line of reasoning can be formalised by using the two atoms a (indicating 
that the adder is working) and m (indicating that the multiplier is working). My 
initial infobase then looks like this: IB = [o, m, -lO -+ -im]. It is easily verified 
that for any basic infobase revision ®, Cn{S{IB ® -i(o A m))) = Cn{a A im), 
which means that m should be discarded and that a and io -+ im should be 
retained. But what should the weakened version of the discarded wff m look 
like? One reasonable option is to discard it completely, or, what amounts to the 
same thing, to weaken it so that it becomes logically valid. Formally, this can 
be accomplished as follows. Let rs be a relevance selection function such that 
rs{IB,aAm) = — {m}. Since is (7B, aAm)-relevant, there 

is such an rs. Now consider the basic infobase contraction 0 which is obtained 
using rs. It can be verified that IBQ-i->{aAm) w IBQ{aAm) w [o, T, -lO im] 
and therefore IB ® i(a A m) ss [o, T, ->a im, i(o A m)j, where ® is the basic 
infobase revision defined in terms of 0 using (Def ® from 0). 

To see that the revision * defined in terms of ® using (Def ^ from ®) violates 
(DPI), observe that Cn{S{IB ® ia)) = Cn{-^a), but that Cn{S{{IB ® i(a A 
m)) ® ia)) = Cn(-ia A im). So K{{^ ^ i(a A m)) * ia) ^ K{$ * ia) even 
though ia 1= i(a A m) where ^ is an epistemic state such that IB^ — IB. 

There is a form of basic infobase revision which always satisfies (DPI). It corre- 
sponds to the coherentist approach to infobase change. 

Definition 17. A coherentist basic infobase revision ® is a basic infobase re- 
vision such that rs{IB,a) = IB for every a & L, for the relevance selection 
function rs from which ® is obtained. 



The example was inspired by a similar example of Darwiche et al. [3j. 
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Proposition 7. Let ® be the coherentist basic infobase revision and let * be the 
revision on epistemic states defined in terms of® using (Def ^ from ®). Then 
* satisfies (DPI). 

5.2 L- Revision 

Lehmann [10] considers iterated belief revision in the context of finite sequences 
of revisions. He extends the notion of a revision * on epistemic states to a revision 
by a finite sequence of wffs. ^ * <r then refers to the iterated revision of ^ by the 
wffs in cr, and if a is the empty sequence, ^ * u is just the epistemic state A 
wff a is identified with a sequence of length one. Considering only sequences of 
satisfiable wffs, Lehmann proposes the following postulates for iterated revision. 

(L*l) K{^) = Cn{K{4>)) 

(L*2) a e K{^^a) 

(L^f3) K(L> * a) C K{^) + a 

(L*4) If a e K{^) then ^a) = K{0 ^ (a • tr)) 

(L^i^5) If a 1= /3 then K{^ ^ {fi • a* a)) = {a • a)) 

(L*6) K(0) ^ Cn{±) 

(L*7) AT(# * (-.Q • a)) C K(^) + a 

(L^8) If -i/? ^ ^ a) then K(^ ^ (a • 0 • o)) = K(0 » (a • a A 0 • a)) 

It is easily verified that the revision * on epistemic states obtained in terms of 
a basic infobase revision using (Def * from ®) satisfies (L*l), (L^2), (L^3) 
and (L*6). It can also be verified that (L*7) is a weakened version of (DP2) 
and it thus follows from proposition 6 that * also satisfies (L *7). It does not 
necessarily satisfy (L^^4), (L^f5) and (L*8), though, as the following example 
shows. 

Example 4- Let ® be the basic infobase revision obtained from the relevance 
selection function rs for which rs(IB, a) = IB~°‘ for every IB G TB and every 
ae L. 

1. Let IB = Ip A ~'q,py g]. Clearly IB®p w [p A ->9,^ V g,pj. It can be verified 
that Cn{S{{IB ®p)® q)) = Cn(p A q), but that Cn{S{IB ® q)) = Cn{q). 
Taking p as a and q as the sequence of wffe cr, this is a violation of (L*4). 

2. Let IB = \p q,py -'q,-'py -'q,-'q\. It can be verified that IB ® q w 
[p <-> q,pV -iqjP V -iq,q], IB ®pV q [p V -'q,-'P V -iq, ~<q,py ?], {IB®p\/ 
q) ® q » [p V q,q], Cn{S{{{IB ® p V q) ® q) ® ~iq)) = Cn{p A -iq), and 
Cn{S{{IB ® q) ® ~<q)) — Cn{-^p A ->q). Taking py q as 0, q as a, and ->q as 
the sequence of wffs a, this constitutes a violation of (L*5). 

3. Let IB = \py q,p V -^q]. Clearly IB ®p w [p V q,p V ~'q,p], {IB ® p) ® q = 
[p V q,p V -’q,p, q], and {IB ®p)®pAq = [pV q,p V -iq,p,p A qj. It can be 
verified that Cn{S{{{IB ® p) ® q) ® -<p)) = Cn{-<p A q), and Cn{S{{{IB ® 
p) ® p A q) ® -ip)) = Cn(-ip). With p as a, q as ,8, and -ip as the sequence of 
wffs a, it follows that (L*8) is violated. 

An examination of this example suggests that, unlike the DP-postulates, (L^c4), 
(L^5) and (L*8) are fundamentally incompatible with basic infobase revision. 
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6 Conclusion 

We have extended the initial infobase proposal of Meyer et al. [12], but much 
still needs to be done. Two obvious extensions that still needs to be developed 
has already been hinted at by Meyer et al. [12]. Both involve the introduction 
orderings of epistemic relevance in the spirit of Nebel [13,14,15]. And finally, it 
remains to be seen how baisc infobase change fits into a more general theory of 
base change. 
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Abstract. In this paper, we identify that fedlure to cater for the vmious 
forms of heterogeneity is one of the major drawbacks of the previous re- 
search on multi-agent belief revision(MABR). Three major categories of 
heterogeneity, namely social, semamtic aind syntactic heterogeneity are 
clarified. Several issues posed by such heterogeneities are addressed in 
the context of BR, The use of ontology is proposed as a powerful tool 
to tackle the heterogeneity issues so as to achieve the necessary reli- 
able communication emd system interoperability required by MABR. The 
question of what kind of ontology would be suitable to support MABR 
in a heterogenous setting is answered in Part I. In its sequel. Part II, a 
general framework for MABR is presented based on a shared knowledge 
structure which serves as the theoretical basis for ontology design. 



1 Introduction 

Belief Revision (BR) is a ubiquitous process underlying many forms of intelligent 
behaviour [33]. An essential skill an autonomous agent should possess is the abil- 
ity to revise its beliefs in a coherent and rational fashion when it receives new 
information. Most BR research, however, has been developed with a single agent 
in mind, ie, only one problem solver using the BR service. SATEN^ - a web based 
BR system which incorporates several revision strategies is a good example of 
a single BR agent. Although it is able to clone its current state, the clones and 
their ancestors can not communicate. In other words, they act independently as 
single agents without awareness of other’s existence. 

Multi- Agent Systems (MASs) are distributed computing systems composed 
of a number of interacting computational entities (possibly from various ven- 
dors). One important characteristic distinguishing MASs from traditional dis- 
tributed systems is that both MAS and its components(agents) are 
intelligent [29]. As MASs become increasingly attractive for solving larger and 
more complex problems, the need for adequate BR technology in the MAS 
paradigm arises. Only a few BR frameworks are known that claim to be suited 
for MAS applications. In section 2, a BR hierarchy is suggested to clarify the ter- 
minologies adopted in resent research on Multi-Agent Belief Revision (MABR). 
Section 3 reviews the evolution of various frameworks for MABR. 

^ http://infosystems.newceistle.edu.au/webworld/saten 

N. Foo (Ed.): AI’99, LNAI 1747, pp. 168-179, 1999. 

(c) Springer- Verlag Berlin Heidelberg 1999 
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An enormous number of forms of heterogeneity exist in MASs because of 
the flexibility and complexity of agent interaction and organisation, in addition, 
agents might be developed by different vendors [9] [27]. Current frameworks do 
not sufficiently support BR which requires and is affected by interagent commu- 
nication and interoperation in a heterogeneous environment. This functionality 
is highly desirable in many domains, such as electronic commerce, group decision 
making and cooperative information systems. In section 4, we describe different 
forms of heterogeneity and emphasize the issues raised by it in BR systems. 

The development of ontologies is becoming widely accepted as a powerful 
methodology for bridging the gap between legacy systems, to enable communi- 
cation and interoperability within a heterogeneous system[17j. In section 5, basic 
concepts of ontology are briefly introduced and the types of ontology needed for 
developing a general MABR system are identified. 

The paper concludes in section 6 by stating the current and future research 
problems to be solved so as to develop sound ontology support for the construc- 
tion of a MABR test bed based on the single agent implementation SATEN. 

2 BR in MASs - Concepts and Terminologies 

A variety of notations have been adopted by researchers investigating BR of 
MASs. A good understanding of the relationships between these approaches is 
essential before carrying out any further research. To clarify the terminologies for 
BR of MASs, let us revisit the definition of “agent”. Generally, an agent implies 
a problem solving entity that both perceives and acts upon the environment in 
which it is situated, applying its individual knowledge, skills and other resources 
to accomplish high-level goals. By employing various algorithms and processes, 
agents are capable of taking various eictions to achieve their individual goals 
or interacting with other agents to achieve mutual goals. According to whether 
BR is involved in individual goals or mutual goals, previous research efforts in 
BR of MAS can be classified into two categories, ie, BR using information from 
Multiple Sources (MSBR) and MABR. 

On one hand, BR could be considered as part of the agent’s skills to maintain 
the consistency of its own epistemic state. In this case, an individual BR process 
is carried out in a multi-agent environment, where the new information may 
come from multiple sources and maybe conflict. BR in this sense is called MSBR 
by Dragoni et al[3][7][4j. Cantwell [2] tries to resolve conflicting information by 
ordering the information sources on the basis of their trustworthiness. This could 
be served as a rational way of generating the new information credibility based 
on the source reliability using the terms of MSBR. Benferhat et. a/.[l] investigate 
revision of information from multiple sources in face of uncertainty as data fusion, 
using possibilistic logic; Liberatore and Schaef[23] treat the MSBR process as 
intelligent merging of knowledge bases, which they call Arbitration. 

On the other hand, BR could also be used to achieve a society’s or team’s 
mutual belief goals(e.g. reaching consensus before carrying out plans). In this 
setting, more than one agent takes part in the process. In order to pursue the 
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mutual goal, agents involved need to communicate, cooperate, coordinate and 
negotiate with one another. A MABR system is a MAS whose mutual goal in- 
volves BR. Since an MAS is actually an intelligent distributed system, an alter- 
native name for MABR could be intelligent Distributed Belief Revision (DBR). 
MABR is the terminology adopted by Kfir-dahav and Tennenholtz[22]. Dragoni 
et al prefer DBR[6] based on the comparison with Distributed Truth Mainte- 
nance (DTM). Van der Meyden’s semantical theory of BR in synchronous MAS, 
namely. Mutual Belief Revision, also falls into the same category as MABR. 

MSBR studies individual agent revision behaviours, ie, when an agent re- 
ceives information from multiple agents towards whom it has social opinions. 
MABR investigates the overall BR behaviour of agent teams or a society. MSBR 
is one of the essential components of MABR. 

The AGM paradigm[12] has been widely accepted as a standard framework 
for BR. But it is only capable of prescribing revision behaviours of a single agent. 
The BR process is more complex in multiple agent case. Besides the Principle 
of Minimal Change, there exist other requisites due to the sophisticated agent 
interactions. Therefore, the AGM framework is not rich enough to prescribe 
a satisfactory revision operator for MABR. In this paper and its sequel^, we 
develop a general framework based on ontology to capture the necessary het- 
erogenous properties so as to enable the sophisticated agent interactions. 

As a result, BR can be thought of in a narrow sense, which encompasses 
all previous work in AGM. It can be considered in a wider sense, taking into 
account MASs, from the viewpoint of an agent and an agent society. An agent 
is capable of carrying out Individual Belief Revision (IBR), while an agent so- 
ciety or team is capable of MABR. IBR in a single agent environment (S'in^/e 
Belief Revision, SBR) could be achieved using classical BR satisfying AGM pos- 
tulates. IBR in a multiple agent environment is MSBR, ie, a single agent will 
have to process information coming from more than one source. After obtaining 
the new credibility of the new information on evaluating the multiple sources 
using some techniques(e.g.[2][4]), MSBR turns to SBR. We can classify the types 
of BR using the hierarchy in Fig.l. 



Belief Revision 

I 1 

Individual Belief Revision Multi- Agent Belief Revision (MABR) 

(IBR) (Intelligent Distributed Belief Revision) 

I 1 

Belief Revision in Individual Belief Revision in MASs 

Single Agent Environment (SBR) Multiple-Source Belief Revision (MSBR) 

Fig. 1. Belief Revision Hierarchy 



^ A Framework for Multi- Agent Belief Revision (Part II; Sheired Knowledge Structure) 
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3 Evolution of MABR Frameworks - A Review 

3.1 Mutual Belief Revision of Van der Meyden 

Mutual Belief Revision is a nested process during which an agent must revise 
not only its own beliefs about the world, but also its beliefs about other agents’ 
beliefs about the world and moreover about other agents’ beliefs about its own 
beliefs, and so on. A multi-agent version of perfect introspection logic K45n is 
employed to model the so-called mutual belief^. 

The theory of mutual belief revision ends up with a unique operator which 
satisfies the following four assumptions: [I] Agents have perfect introspection^, 
but maybe inconsistent; [II] Agents revise their beliefs synchronously, in re- 
sponse to an event e whose occurrence is common knowledge; [III] The world 
is static; [IV] Each agent’s revision method is common knowledge. Ill simplifies 
the problem so that there is no need to consider the temporal factor. But the 
other three assumptions limit the proposed theory’s capability of handling het- 
erogenous multi-agent environment. The inability of modelling incompetent or 
stupid agents is the major drawback associated with the perfect introspection 
assumption I. II confines this theory to the simple environment of broadcasting 
and synchronizing. Actually, IV and II together assumes each agent’s revision 
process is common knowledge, that is, transparent to other agents. 

In the given “scientist in conference” example [26], all the agents work faith- 
fully, broadcasting what they know and what they do not know. Considering a 
distributed knowledge base system, where an agent represents a local Knowledge 
Base{KB). Based on the above four assumptions, the agent should store its own 
knowledge about the world. Meanwhile, it has full accessibility to all the remote 
KBs of other agents. Therefore, every agent knows what the others know and 
what they do not know. On the receipt of new information, first the agent re- 
vises its own beliefs, then because the revision method is common knowledge, 
it could readily and successfully predict what all the other agents will do with 
their beliefs. In fact, mutual belief revision can be thought of as a set of single 
BR processes carried out uniformly in a parallel but decentralized manner. 



3.2 MABR of Kfir-dahav and Tennenholtz 

In contrast to Van der Meyden, Kfir-dahav and Tennenholtz [22] initiate research 
on MABR in the context of heterogeneous systems. The Private Domains {PDi) 
and the Shared Domain {SD) of the agent knowledge base are defined in order 
to capture a general setting where each agent has private beliefs as well as beliefs 
shared with other agents. Under such knowledge structure, each agent may have 
its own perspective of the world but needs to coordinate (ie agree on) its belief 
on shared elements. The shared domain also defines the communication language 
for the agents. 

® Roughly, a set of agents mutually believe iff eeich of them believes and eeich of 
them believes that esich of them believes v’, and so on, ad infinitum\\Q\. 

* Perfect introspection assumes a considerable degree of self-knowledge, [11] has deteiil. 




172 Wei Liu euid Maxy-Anne Williams 



One important question to this scenario is how do we manage the knowl- 
edge in PDi and SD? By definition[22], SD is just the intersection of each 
agent’s KB. Although it is not explictly stated, the authors assume the agenti 
is aware of (knows) its own knowledge in PDi and the existence of SD. Since 
an agent does not know other agents’ PD, what happens if after several se- 
quences of revision, the intersection of PDiS is not empty? For example, con- 
sider the situation in which two agents(Ai, A2) engage in revision, stands 
for the KB of Ai, i=l,2. <?i = {7, ^2 = Then according to 

the definition, SD = {<f>};PDi = PD 2 = {a, If A 2 receives a piece 

of new information, say, then by Modus Ponens, tp should be in PD 2 - 

Thus,PDi n PD 2 = {tp} 0. A 2 will only expand PD 2 but not SD providing 
that nobody telling that ^p is also believed by Ai. The authors did not tell us 
what to do with this xp. If xpeSD, it seems to be implicitly assumed that there 
exists a super agent who knows all the agents’ private knowledge and the soci- 
ety’s shared knowledge, so that the intersection of private knowledge can always 
be upgraded to the shared domain. If ip^SD, the definition of SD has been 
violated. 



3.3 MSBR and DBR of Dragoni et al. 

MSBR. Recognizing that agents may join the network with low degrees of 
competence or non-cooperative intentions, Dragoni et al states that the relia- 
bility of the source affects the credibility of the information and vice-versa[5]. 
Neglecting the “priority to the incoming information principle” is thus proposed 
and implemented by considering {iriforrnarit,inforrnaticm) rather than just 
information. 

In the AGM paradigm, priority is given to the incoming information. For 
instance, the new information will be accepted using an expansion if it is consis- 
tent with the current belief sets. In the sense of transmutation[32] , or possibility 
theory[8] or other revision schemes, the new information is allowed to come with 
a certain rank and to be accepted at this prescribed level. In this case although 
you don’t necessarily totally accept the new information, you do need to respect 
the incoming rank. There is no explicit rational step to change the rank. While 
the non-priority of or in the extreme case, neglecting incoming information is 
thus a two step procedure, first, revise the rank according to the reliability of 
the informant and then incorporate the information with the new rank. There- 
fore, by evaluating the source reliability, the receiver agent has the flexibility of 
deciding whether to take the impinging information into account or not. 



DBR. In DTM[19], all the agents are both individually and mutually consistent 
with any other agent with whom they exchanged knowledge. While in DBR[5][6], 
the “Liberal Belief Revision Policy” is adopted, that is, to let all the agents stand 
by their own beliefs based on their own view of the evidence. Therefore, the local 
consistency is considered as a prerequisite, but the global consistency is only 
considered as an end point which is eventually reached through some selection 
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strategies. Every node (agent) in the DBR system is able to carry out MSBR as 
well as communicate with each other. 

As to the knowledge structure, local knowledge has been distinguished from 
global knowledge. By using some voting functions [5], the local knowledge can 
be selected to become global knowledge. Compared to Kfir-dahav and Tennen- 
holtz’s terminology, Dragoni et al’s local and global knowledge could be seen 
as the counterparts to the knowledge in PDi and SD, respectively. But a sub- 
tle improvement is made by Dragoni et al that certain voting function is em- 
ployed to generate global knowledge rather than simply taking the intersection 
of PDiS. Actually, private or local knowledge as defined here are not private in 
the real sense, in other words, it is implicitly assumed some super agent, at least 
a human developer, exists in the system to supervise the knowledge upgrade^. 
Private knowledge should be confidential, ie, invisible to all the outsiders. To 
classify agent knowledge in these ways is still too simple to some extent. For 
example, in an agent society housing multiple agents, some agents might wish 
to form small groups or teams to accomplish goals. Therefore, shared knowledge 
(or team knowledge) rather than global knowledge is sometimes prefered. 



3.4 Summary 

To summarize, mutual belief revision is only capable of revising knowledge, but 
not graded belief. Kfir-dahav and Tennenholtz’s MABR and Dragoni et al’s DBR 
have overcome this by discussing revision in the broad sense of knowledge system 
transmutation. The social behaviour that might affect the information credibil- 
ity has been discussed by Dragoni et al in the context of MSBR. This is a great 
advance over other studies based on reliable, faithful and mutually trustworthy 
communication. The knowledge structures proposed by Kfir-dahav and Tennen- 
holtz and Dragoni et al initiate the effort of classifying knowledge. But they 
are not rich enough to eliminate some ambiguity as well as offer some essential 
flexibility. It will be shown in section 4 that finding a feasible way of classify- 
ing knowledge is an important step towards modelling social heterogeneity from 
the viewpoint of KB. For all schemes discussed above, none address the issues 
might arise from multiple revision strategies, while this could be highly desired 
in a heterogenous revision system. Although MABR and DBR’s communication 
mode is not restricted to broadcast as mutual belief revision does, the underly- 
ing heterogeneity issues which might inhibit efficient and reliable communication 
has not been addressed to any depth. Since the heterogeneity exists elsewhere 
in MASs, special issues arise in MABR, which must be discussed. 



® Knowledge upgrade is one phase of knowledge migration, which also encompaisses 
the dual phase of upgrade, ie knowledge degrade. Upgrade refers to the process that 
local knowledge is selected to become global knowledge, while degrade is the opposite 
process. Knowledge migration wiU be fully discussed in Part II of this paper. 
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4 Heterogeneity of Multi-Agent Systems 

Much of the conceptual power of the MAS paradigm arises from the flexibility 
and sophistication of the agent interactions and organisations. As the basic skill 
for both individual agents and agent societies to maintain consistency, the BR 
process and result will highly depend on the way agents communicate, coop- 
erate, coordinate and negotiate with one another. Heterogeneity is the one of 
the basic originations of such flexibility but it also causes complexity. Taking 
the opportunities as well as solving the diflSculties could lead to a more dy- 
namic and versatile BR perspective. Therefore, it is necessary to study how the 
heterogeneities might affect BR behaviours in MASs. 

Heterogeneities in a MAS could have many forms ranging from the hardware 
and software platform that each agent is based on, to the organisation schema 
that relate individual agents socially to others forming teams and societies, to 
the basic knowledge representation structure and reasoning strategy that makes 
the agent intelligent, to the problem domain that an agent specialised in. This 
paper focuses on the issues raised by heterogeneities in the knowledge systems, 
which includes the last two cases. Following is a brief classification according to 
the source (ie level) of these knowledge system heterogeneities. 

- Social Character Heterogeneity(social-level): Within the MAS paradigm, an 
agent is socially situated in a particular environment with other agents. As 
a problem solving entity, an agent is also defined as software that acts on 
behalf of the user to accomplish a task assigned by the user. Therefore, just 
as human could behave in a benevolent or malicious way, there is no reason 
to forbid agents from possessing such characters. In the context of modelling 
trust[21], “free will ’’has been defined to describe the mental process that 
decides between benevolent and malicious behaviour. Agents possessing free 
will of this type are designated as passionate. However, an agent, such as 
algorithms, protocols, software and hardware which could hardly be charac- 
terized as having a free will is classified as rational. On the other hand, due 
to technical or other possible reasons (e.g. hardware quality), the agent could 
either be competent or incompetent. Combining these characteristics, agents 
could be roughly classified into four categories: competent rational, compe- 
tent passionate, incompetent rational and incompetent passionate agent. 

- Semantic or Logical Heterogeneity(meta-level): Borrowing terms from coop- 
erative information systems(CIS)[20], semantic heterogeneity results when 
different conceptualisations and different database schemas are used to rep- 
resent the same or overlapping data which is replicated in two or more 
databases. A simple real world example could be the various grading tech- 
niques in an educational system such as percentage or letter grade[18]. A 
generalisation of such heterogeneity also occurs in the agent KBs. For exam- 
ple, the knowledge could be represented using different logics, in favour of 
agent’s problem solving capability with respect to certain problem domain. 

- Syntactic Heterogeneity (content-level): This is a domain specific heterogene- 
ity which arises from the fact that in many cases the same letters or words 
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are used to represent different concepts or objects by different agents and 
vice versa. This happens during the process of building knowledge into au- 
tonomous agents to enhance their intelligence. Natural language is commonly 
transformed into a simple logical form. Developers have the freedom to name 
things as they wish. For example, some agent may choose the letter “a” for 
describing an apple, while some other agents might prefer the whole word 
“apple” or something else. This is also a common phenomenon in all the 
emerging research areas, different terminologies have been used to describe 
the same concept, or vice versa. While a research area matures, general terms 
and frameworks will be proposed to serve as a specification. 

It can be seen from the previous study in MABR, researchers have progressively 
put more accent on the heterogeneity existing in agent knowledge bases. In the 
early 1990’s, Fagin et al[10] semantically defined mutual belief and common 
knowledge. Van der Meyden extended this modal logic approach into MABR in 
1994. Malheiro[25] in the same year defined private and shared belief to model 
the BR process in a DTM system. Similarly, Kfir-dahav and Tennenholtz claimed 
in 1996 that their work[22] is more amenable to solve the heterogeneity problem 
than Van der Meyden’s by stating that agent knowledge could fall into private 
and shared domains. Recently, other researchers [24] [30], following Fagin et al’s 
approach, defined various concepts such ais team knowledge and shared knowl- 
edge. Dragoni et al also distinguishes global knowledge from local knowledge[5]. 

Actually, defining shared, common and private knowledge paved the way 
for modelling the social character heterogeneity, in other words, the diversity 
of different kinds of knowledge is the reflection of social heterogeneity in agent 
KBs. Private knowledge is needed by a passionate agent to keep confidential 
information. Such privacy enables the possibility of malicious behaviour which 
is sometimes needed for an agent to maximize its own or group utility in a 
competitive environment. On the other hand, common or shared knowledge is 
essential to establish cooperation-oriented communication and commitment. For 
example, incompetent agents could carry out teamwork by sharing knowledge so 
as to achieve high-level goals which could not be accomplished by any individual. 
Understanding the similarities and distinctions among these conceptualisations 
of agent knowledge poses a serious challenge when trying to integrate systems 
based on different knowledge structures. 

Semantic and syntactic heterogeneity have not yet been studied in the context 
of MABR. 

Various BR strategies are examples of semantic heterogeneities that exist in 
MABR. For BR within AGM paradigm, many revision schemes have emerged 
during the past decade, such as numerical revision using probabilistic[28] or pos- 
sibilistic logic[8], sentence- based revision using various transmutations [32] and 
so on. Because a variety of ranking mechanisms® might be used when employ- 
ing various revision strategies, special communication and interoperability issues 
arise. How does an agent communicate with each other in terms of the informa- 

® Ranging from ordinal natural numbers(ie {0, 1, 2, ..., oo}) to unit interval (ie [0, 1]). 
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tion credibility? In other words, how does an agent incorporate new information 
from another ranking system? 

Communication and interoperability difficulties are associated with syntactic 
heterogeneity too. How could the information reliability be guaranteed during 
the information passing process in a system with such low-level heterogeneity 
that same letter means something different? 

5 Ontology - A Solution to the Heterogeneity Issues 

Recently, it has become more and more widely accepted that ontologies are effi- 
cient approaches to solve the problems involved in the generalization of low-level 
heterogenous data to relatively high-level concepts for the purposes of communi- 
cation, system interoperability and software reusability. To solve heterogeneity 
issues in MABR different types of ontology are needed. 

The basic idea of conceptualisation needs to be clarified first before introduc- 
ing ontology. According to Guarino[16], a conceptualisation is a set of informal 
rules that constrain the structure of a piece of reality. A conceptualisation may 
be implicit, e.g. exist only in someone’s head, or embodied in a piece of soft- 
ware. An explicit account or representation of some part of a conceptualisation 
is usually called an ontology[l5]. 

Uschold’s review paper on knowledge level modelling{31] has an excellent 
treatise on ontology related terms and concepts. Following is some of the key 
dimensions along which ontology may vary adapted from this review. 

- Formality: An ontology is highly informal if expressed loosely in natural 
language; structured informal if in a restricted and structured form of natural 
language; Semi-formal if in an artificial formally defined language such as 
Ontolingua and Knowledge Interchange Format(KIF) (links in [14]) and etc; 
rigorously formal if in meticulously defined terms with formal semantics, 
theorems and proofs of such properties as soundness and completeness. 

- Purpose: There are three main categories of use of ontologies: for the pur- 
pose of communication between people and organisations, of inter- operability 
between systems and of system engineering benefits. 

- Subject matter: Domain ontology is for special subjects such as medicine, 
finance and etc. Upper model is general world knowledge. Task, method or 
problem solving ontology is for the subject of problem solving. Representation 
ontology is for the subject of a knowledge representation language. 

To enable various degrees of sharing among agent knowledge bases, it is neces- 
sary to establish a general knowledge structure which could capture not only the 
property of shared knowledge/belief but also that of private knowledge/belief. 
The main purpose of this general framework is to enable interoperability within a 
heterogenous society and communication across society boundaries. A heteroge- 
nous society is an agent society populated with both rational and passionate 
agents who might be competent or incompetent in a certain domain. An ontol- 
ogy for this purpose serves as an interchange format to translate between dif- 
ferent modelling methods and paradigms. Ontolingua or KIF would be suitable 
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for designing a computer executable semi-formal ontology. Therefore, consider- 
ing the general framework nature, the ontology needed here could be seen as 
a semi-formal upper model for the classification of agent knowledge base. The 
upper model here bridges the gap between rational and passionate society. It 
also establishes a foundation for modelling heterogenous society. An implemen- 
tal proposal for modelling the agent heterogenous behaviour could be based on 
whether the agent will release its true opinion on the credibility of the informa- 
tion it passes to others. Rational agents will never deliver information which is 
not believed or with wrong credibility. Passionate agents will pass information 
with wrong rank occasionally. It is more complicated when trying to model the 
agent competency. Many aspects are involved. Suppose agent qualification is 
evaluated by the adequacy of its knowledge with regard to some special domain, 
the model should be able to reasoning about ignorance. If an agent is evaluated 
by the reliability of its revision skill, competency could be judged by whether 
there is inconsistency in its knowledge base. In this sense, an inconsistent local 
knowledge base is allowed in order to capture the behaviour of stupid agents(i.e. 
faulty reasoners), who believe in sentences inconsistent with their knowledge 
bases. Detailed discussion of the upper model will appear in Part II. 

Similarly, according to the classification criteria above, on the semantic het- 
erogeneity of ranking system, a problem solving ontology is needed to transform 
one ranking system to another. While as to the syntactic heterogeneity, a specific 
domain ontology need to be developed. To be computer executable, normally, 
semi-formal and rigorously formal format would be preferable or required. 

6 Future Work 

Ontologies can be both constructive and destructive. Information distortion 
could occur during the generalisation process of ontology design. Many crite- 
ria exist to prescribe a good ontology[13] and more are still under investigation. 
Essentially, the key is to thoroughly understand the problem domain. 

To design the upper model of knowledge structure, both philosophical and 
psychological investigations into the nature of shared/common and private 
knowledge would be a good start. Currently, a general knowledge structure, 
which encompasses private, shared and accessible knowledge, is proposed along 
the way to our model (described in Part II of this paper). Many interesting 
features about this structure are still under investigation, e.g. knowledge mi- 
gration, speech act generation and exception tolerant nature. In the knowledge 
migration phase, a proper revision operator on a shared domain poses a great 
challenge due to the fact that BR in this case is a mixed process composed 
of traditional BR, communication and other interactions. Sophisticated decision 
making techniques are required of such BR operator, because the revision process 
could branch according to communication mode and social intension. Therefore 
in certain circumstances, the revision result might not be easily predicted. The 
implementation of such idea to a multi-agent version of SATEN is the focus of 
our current research. 

The possibility of translating among various ranking systems is another chal- 
lenge. It is important to note that not all ranking systems are equally expressive. 
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hence information will be lost when one translates ranks from a rich ranking 
system such as [0,1] to an ordinal one. Determining the appropriate translation 
may require substantial communication between agents to clarify the intended 
meaning of their ranking. This determination is to be supported by the ontology. 

Finally, the established system will be tested by applying it to special domains 
such as electronic commerce and group decision support systems. 
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Abstract. We describe an approeich to the use of genetic programming 
for object detection problems in which the locations of small objects of 
multiple classes in large pictures must be found. The evolved programs 
use a feature set computed from a squetfe input field large enough to 
contain each of objects of interest emd are applied, in moving window 
fashion, over the large pictures in order to locate the objects of interest. 
The fitness function is based on the detection rate and the false alarm 
rate. We have tested the method on three object detection problems of 
increasing difficulty with four different classes of interest. On pictures of 
easy and medium difficulty all objects are detected with no false alarms. 
On difficult pictures there are stiff significant numbers of errors, however 
the results are considerably better than those of a neural network based 
program for the same problems. 

Keywords: Maichine learning. Genetic algorithms. Neural networks, Vi- 
sion 



1 Introduction 

As more and more images are captured in electronic form the need for programs 
which can find objects of interest in a database of images is increasing. For ex- 
ample, it may be necessary to find all tumors in a database of x-ray images, 
all cyclones in a database of satellite images or a particular face in a database 
of photographs. The common characteristic of such problems can be phrased 
as “Given subpicturei,subpicture 2 .--subpicturen which are examples of the ob- 
ject of interest, find all pictures which contain this object and its location(s)”. 
Figure 4 shows examples of problems of this kind. Figure 4c shows a human 
retina. We are required to find all of the micro aneurisms and haemorrhages, as 
indicated by the white squares. Figure 5 shows an enlarged view. Examples of 
other problems of this kind include target detection problems [5,17,19] where the 
task is to find, say, all tanks, trucks or helicopters in a picture. Unlike most of 
the current work in the object recognition area, where the task is to detect only 
objects of one class [5,11,12], our objective is to detect objects from a number 
of classes. 
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Several approaches have been applied to automatic object detection and 
recognition problems. Typically, they use multiple independent stages, such 
as preprocessing, edge detection, segmentation, feature extraction and object 
classification [4, 10], which often results in some efficiency and effectiveness prob- 
lems. The final results rely too much upon the results of each stage. If some 
objects are lost in one of the early stages, it is very difficult or impossible to re- 
cover them in the later stage. To avoid these disadvantages, this paper introduces 
a single stage approach. 

There have been a number of reports on the use of genetic programming in 
object detection and classification [13,15]. [18] describes a genetic programming 
system for object detection in which the evolved functions operate directly on the 
pixel values. [16] describes a genetic programming system and a face recognition 
application in which the evolved programs have a local indexed memory. All of 
these approaches are based on two class problems, that is, objects vs everything 
else. 

Performance in object detection is measured by detection rate and false alarm 
rate. The detection rate is the number of objects correctly reported as a per- 
centage of the total number of real objects and false alarm rate is the number 
of objects incorrectly reported as a percentage of the total number of real ob- 
jects. For example, a detection system looking for grey squares in figure 4a may 
report that there are 25. If 9 of these are correct the detection rate will be 
(9/18) + 100 = 50%. The false alarm rate will be (16/18) + 100 = 88.9%. It is im- 
portant to note that finding objects in pictures with very cluttered backgrounds 
is an extremely difficult problem and that false detection rates of 200-2,000% 
(that is the detection system suggests that there are 20 times as many objects 
as there really are) are common [12,14]. 



1.1 Outline of the Approach to Object Detection 

A brief outline of the method is as follows: 

1 . Assemble a database of pictures in which the locations and classes of all of 
the objects of interest are manually determined. Reserve some of the pictures 
as ‘unknowns’ for measuring detection performance. 

2. Determine an appropriate size (n) of a square which will cover all objects of 
interest and form the input field. 

3. Invoke an evolutionary process to generate a program which can determine 
the class of an object in its input field. 

4. Apply the generated program as a moving window template [1] to the pic- 
tures reserved with step 1 and obtain the locations of all the objects of 
interest in each class. Calculate the detection rate and the false alarm rate 
on the test set as the measure of performance. 
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1.2 Goals 

Our overall goal is to determine whether programs evolved by genetic program- 
ming can do a good enough job of finding the objects of interest in pictures such 
as those shown in figure 4. Specifically we are interested in: 

— What image processing features would make useful terminals? 

— Whether the four standard arithmetic operators will be sufficient for the 
function set? 

— How can the fitness function be constructed given that there are several 
classes of interest? 

— How will performance vary with increasing difficulty of image detection prob- 
lems? 

— Will the performance be better than a neural network approach [22] on the 
same problems. 

2 Genetic Programming 

Genetic programming is a relatively recent technology based on the use of Dar- 
winian evolution in the generation of computer programs. Developed and first 
published by John Koza in[7], it has been successfully applied in areas such as 
pattern recognition[13], control[9|, robocup[20] and modelling[6j. The process 
starts with a randomly generated population of programs. Each program is exe- 
cuted and its degree of success in achieving its task is measured and assigned as 
its fitness. Programs with high fitness are then selected for mating. In the mat- 
ing process two parents are chosen and randomly selected sub-trees are swapped 
giving two children of a new population. In general, individuals in the new gener- 
ation will be fitter than those in the current generation. The process terminates 
when the best individual does not improve over the course of a few generations. 

The programs are constructed from a terminal set and a function set which 
will vary according to the problem domain. Functions form the root and the 
internal nodes of the parse tree representation of a program. Terminals have 
no arguments and form the leaves of the parse tree. Terminals represent the 
inputs to the program. Assuming that one has available a generalized genetic 
programming ‘engine’ to perform the evolutionary processes, the task of using 
genetic programming for any given problem becomes one of determining the 
appropriate set of functions and terminals and a suitable fitness function. It is 
important to note that the selection of the functions and terminals is critical to 
success. A bad selection could result in very slow convergence or to not being able 
to find a solution at all. More details of genetic programming and its applications 
can be found in [2,8]. 

2.1 Genetic Programming Adapted to Object Detection 

As noted above the main tasks in using genetic programming in some problem 
domain are (l)Determine the terminal set, (2)Determine the function set, and 
(S)Determine the method of mesisuring fitness. 
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2.2 The Terminal Set 

For object detection problems terminals correspond to image features. In our 
case twenty features, Fl to F20 in figure 1, are extracted firom a square input 
field, which is large enough to contain all objects of interest. In figure la, the 
grey circle is an object of interest, the filled square Al-Bl-Cl-Dl-Al represents 
the square input field. Other squares represent regions for which features will 
be computed. The mean and standard deviation of the pixels comprising each 
of these regions are used as two separate features. There are 6 regions giving 12 
features, Fl to F12. Also we use pixels along the main axes of the input field as 
shown in figure lb, giving features F13 to F20. 



A1 


El 


B1 


1 Features 


Regions and Axes of interest 








mean 


sd 




A2 


E2 


B2 


Fl 


F2 


big square Al-Bl-Cl-Dl 
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small central square A2-B2-C2-D2 
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lower right square O-Hl-Cl-Fl 




— 
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F13 


F14 


central row of the big square Gl-Hl 




F2 




F15 


F16 


central column of the big square El-Fl 








F17 


F18 


central row of the small square G2-H2 


D1 


Fl 


Cl 


F19 




central column of the small square E2-F2 



(a) (b) 



Fig. 1. Feature extraction based on local image region intensities. Note: sd - 
standard deviation 



In addition to these features we have a terminal which generates a random 
number in the range [0,255]. This corresponds to the number of grey levels in 
the images. 

These features have the following characteristics: 

— They are symmetrical and contain some information of object translation 
and rotation invariance. 

— Local region features are included. This assists the finding of object centres 
in the sweeping procedure - if the evolved program is considered as a mov- 
ing window template, the match between the template and the sub image 
forming the input field will be better when the moving template is close to 
the centre of an object. 

— They are domain independent and easy to extract. 

2.3 The Function Set 

We use the function set: F = {-f , — , *, /} which represents four arithmetic oper- 
ations that form the second order nodes (i.e. 2 arguments). The -|-, -, and * op- 
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erators have their usual meanings while / represents “protected” division which 
is the usual division operator except that a divide by zero gives a result of zero. 
A generated program that performed particularly well is shown in figure 2. 



(+ (- (+ (+ (/ fl6 fl4) f5) (+ (/ (/ fU (* fl4 f20)) fll) (- 112 
114))) (-(*(- (♦ (* (* 19 111) 11) 110) (♦ 19 117)) (/ 15 118)) 
(- (+ (+ 117 (* (+ 111 112) 120)) (♦ (- (+ 12 145.765) (/ 16 111)) 
(- 133.082 117))) (/ 111 (* 114 120))))) (* {- (♦ (- (- 16 15) (♦ 
13 16)) (/ (+ (+ 11 145.765) (.* 116 110)) 118)) 112) (+ (+ 117 (♦ 
(+ 117 112) 120)) <♦ (+ 114 112) (- (+ 11 112) 117))))) 



Fig. 2. A generated program for the coins problem 



The output of any program is a floating point number which must indicate 
which of the objects of interest is currently in its input field. This is achieved 
as shown in figure 3 where n is the number of classes of interest, ProgOut is the 
output value of the evolved program and T is a constant. 



begin 

if (ProgOut e (- 00 , 0)} then 

the object is classified as background 
else if (ProgOut € [0, T)) 

the object is classified as class 1 

else if (ProgOut € [(i-l)xT, ixT) ) 
the object is classified as class i 

else if (ProgOut € [(n-l)xT, +oo) ) 
the object is classified as class n 
endif 




(a) 



(b) 



Fig. 3. Mapping of program output to an object classification 



2.4 The Fitness Function 

The fitness of a program in the population is calculated by using its detection 
rate and false alarm rate on the training images and is obtained as follows: 
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1. Apply the program as a moving nx n (n is the size of the input field) window 
template to each of the full training images and obtain the output values 
of the program at each pixel position. Label each pixel position with the 
‘detected’ object according to the algorithm described in figure 3. Call this 
data structure a detection map. 

2. Find the centres of objects of interest only. This is done as follows: 

- Scan the detection map for an object of interest. When one is found 
mark this point as the centre of the object and continue the scan n/2 
pixels later. 

3. Match these ‘detected’ objects with the known locations of each of the ‘true’ 
objects and their classes. A match is considered to occur if the detected ob- 
ject is within TOLERANCE pixels of its known true location. The detection 
rate Dr and the false alarm rate Fr are then computed. 

4. Compute the fitness as shown in equation 1. 

fitness{Fr, Dr) = A * Fr + B * {1 — Dr) (1) 

where A and B are constants which reflect the relative importance of false 
alarm rate versus detection rate. 

With this design, it is clear that the smaller the fitness, the better the perfor- 
mance. Zero fitness is the ideal case, which corresponds to the situation in which 
all of the objects of interest in each class are correctly found by the evolved 
program without any false alarms. 



2.5 Genetic Programming Pcmameters 

The values for the various system parameters used in the experiments are shown 
in table 1. POPULATION.SIZE is the number of individuals in the population, 
ELITISM_PCNT gives the percentage of the best individuals in the current 
population that are copied unchanged to the next generation, CROSS_RATE 
is the percentage of individuals in the next generation that are to be pro- 
duced by crossover, MUTATION-RATE is the percentage of individuals in the 
next generation that are to be produced by mutation (thus ELITISM_PCNT 
+ CROSS-HATE + MUTATION.HATE = 100%), CROSS.CHANCE.TERM 
is the probability that, in a crossover operation two terminals will be swapped, 
CROSS_CHANCE_FUNC is the probabiUty that in a crossover operation random 
subtrees will be swapped (thus CROSS-CHANCE.TERM + CROSS-CHANCE. 
FUNC = 100%), INITIAL.MAX-DEPTH is the maximum depth of the ran- 
domly generated programs in the initial population, MAX_DEPTH is the maxi- 
mum depth permitted for programs resulting from crossover and mutation oper- 
ations, MAX-GENERATIONS gives the stopping condition, T, A, B and TOL- 
ERANCE are as described above. 
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Table 1. Parameters used for GP training in the three databases 



Parameters 


Easy Pictures 


Coin Pictures 


Retina Pictures 


POPULATION^IZE 


100 


500 


500 


ELITISM_PCNT 


10% 


1% 


2% 


CROSS JIATE 


65% 


74% 


73% 


MUTATION-RATE 


25% 


25% 


25% 


CROSS-CHANCE-TERM 


15% 


15% 


15% 


CROSS-CHANCE.FUNC 


85% 


85% 


85% 


INITIALJMAXJIEPTH 


5 


5 


5 


MAXJDEPTH 


8 


12 


20 


MAX.GENERATIONS 


100 


200 


250 


T 


100 


100 


100 


A 


50 


50 


50 


B 


1000 


1000 


3000 


TOLERANCE (pixels) 


2 


2 


2 



3 The Image Databases 

We used three different databases in the experiments. Example pictures and 
key characteristics are given in figure 4. The pictures were selected to provide 
problems of increasing difficulty. Database 1 (Easy) was generated to give well 
defined objects against a uniform background. The pixels of the objects were 
generated using a Gaussian generator with different means and variances for 
each class. The coin pictures (database 2) were intended to be somewhat harder 
and were taken with a CCD camera over a number of days with relatively sim- 
ilar illumination. In these pictures the background varies slightly in different 
areas of the image and between images and the objects to be detected are more 
complex, but still regular. The retina pictures (database 3) were taken by a 
professional photographer with special apparatus at a clinic and contain very 
irregular objects on a very cluttered background. The objective is to find two 
classes of retinal pathologies - haemorrhages and micro aneurisms. Note that in 
each of the databases the background counts as a class. The objective is to find 
the locations (centres) of all the objects of interest in each class. 

4 Results 

This section presents a series of the experiments on detection problems of in- 
creasing difficulty with different classes of interest. The results are compared 
with those obtained using a neural network approach for object detection on the 
same databases [21,22]. The method used was the same as that shown in section 
1.1, except that step 3 was replaced by a neural network. 
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Number of Images: 7 
No. of interest Classes: 3 
Picture Size 700x700 



Number of Images: 20 
No. of interest Classes: 4 
Picture Size 640x480 



Number of Images: 15 
No. of interest Classes: 2 
Picture Size 1024x1024 



Easy (Circles and Squares) Medium Difficulty (Coins) Very Difficult (Retinas) 
(a) (b) (c) 

Fig. 4. Object Detection Problems of Increasing Difficulty 



4.1 Easy Pictures 

Table 2 shows a comparison of the results between the two methods. For Classl 
(black circles) and Class3 (grey circles) both methods achieved a 100% detection 
rate with 0% false alarms. For Class2 (grey squares) the genetic programming 
system also achieved 100% detection rate with 0% false alarms. However the 
neural network system had a false alarm rate of 92% at a detection rate of 
100%. 



Table 2. Comparison of object detection results on 3-class easy pictures. Input 
field size: 14x14 



Easy Pictures 


Object Classes 


Classl 


Class2 


ClassS 


Best Detection Rate(%) 


100 


100 


100 


False Alawm 
Rate (%) 


Neural network 


0 


92 


0 


Genetic programming 


0 


0 


0 



4.2 Coin Pictures 

Experiments with coin pictures gave similar results which are shown in Table 3. 
Detecting the heads and tails of 5 cents is relatively straight forward, where the 
neural network approach led to 100% of detection rate without any false alarms. 
Detecting heads and tails of 20 cent coins is a difficult problem, where the neural 
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network method resulted in many false alarms. The genetic programming method 
gave the ideal results, that is, all the objects of interest were found without any 
false positives for all the four object classes. 



Table 3. Comparison of object detection results on 4-class coin pictures. Input 
field size: 24x24 



Coin Pictures 


Object Classes 


o 

o 


tailOOS 


head020 


tail020 


Best Detection Rate(%) 


100 


100 


100 


100 


False Alarm 
Rate (%) 


Neural network 


0 


0 


182 


37.5 


Genetic programming 


0 


0 


0 


0 



4.3 Retina Pictures 

The results for the retina pictures are summarised in Table 4. Compared with the 
results for the previous image databases the results are disappointing. However, 
the false alarm rate is greatly improved over the neural network method. 

The results over the three data bases show similar trends: the genetic pro- 
gramming method always gives a lower false alarm rate for the same detection 
rate. 



Table 4. Comparison of object detection results on retina pictures. Input field 
size: 16x16 



Retina Pictures 


Object Classes 


haem 


micro 


Best Detection Rate(%) 


70 


100 


False Alarm 
Rate (%) 


Neural network 


2698 


10104 


Genetic programming 


1357 


588 



4.4 Summary and Analysis 

Summary We have tested the new genetic programming based approach on 
multi-class object detection problems of increasing difficulty: a three-class easy 
detection problem, a four-class medium difficulty coin detection problem and 
a very difficult two-class retinal pathology detection problem. In all cases on 
easy and medium difficulty detection problems the new approach produced the 
ideal results, that is, all the objects of interest in every class were found without 
any false alarms. For “micro” and “haem” detection in the very difiicult retina 
pictures, the new method resulted in much better performance than the neural 
network approach. 
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Analysis of Results of Retina Picture We found three main reasons why 
the results on retina pictures are not as good as those on easy and coin pictures. 
Firstly, in easy and coin pictures the background is relatively uniform, whereas 
in the retina pictures it is highly cluttered. Secondly, in the retina pictures, 
there are only two classes of interest, that is, “micro” and “haem”, but there 
are also several other classes such as veins and other eye anatomy. Thus the 
objects of non-interest are classified background. It appears that this makes the 
background too complex to be considered as a single class. Thirdly, in the easy 
and coin pictures all of the objects in a class are the same size whereas the sizes 
of the objects in each class in the retina pictures are quite different. The sizes of 
“micro” vary from 3 x 3 to 5 x 5 pixels and the sizes of “haem” vary from 7 x 7 to 
16 X 16 pixels. Thus the sizes of “micro” are still similar, but this is not the case 
for the “haem”. This might be the main reason that the results for detecting 
“micro” are much better than that of “haem” . This suggests that the current set 
of functions and terminals cannot be successfully applied to detecting objects in 
the same class with a large variation in size. 



Further Experiments We experimented with an alternative set of terminals 
based on a circular set of features. The features were computed based on a 
series of concentric circles centred in input field. This terminal set focused on 
boundaries rather than regions. In the coin and retina pictures, the results based 
on this terminal set are slightly worse than those with the square region feature 
set. This suggests that the local region features are better for these detection 
problems. 

We also experimented with a different function set. We hypothesised that con- 
vergence might be quicker if the function values were close to the range (—1,1). 
We used dabs, sin and exp, (absolute value, sine and exponent to base e) in ad- 
dition to the four arithmetic operators {+, -, * and /). The detection results are 
similar to those based on the function set of only the four arithmetic operators. 
Convergence was slightly faster training the coin and retina pictures, but slightly 
slower for easy pictures. This suggests that dabs, sin and exp may be useful for 
more difficult problems, however considerably more experimentation is required. 



5 Conclusions 

The goal of this paper was to develop a general, single stage method for detecting 
small objects of multiple classes in large pictures based on genetic programming. 
A secondary goal was to compare the performance of this method with a neural 
network based method. The results show that genetic programming can be used 
to generate programs for object detection. The method appears to be applicable 
to detection problems of varying difficulty. The genetic programming method 
also gives better detection performance than a neural network based approach 
for the same problem set. 
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Fig. 5. Enlargement of retina picture 



The genetic programming approach has the following limitations; 

- The training times for the coin problem and the retina problem are quite 
long. Some of the runs took longer than 48 hours on a 4 processor ULTRA- 
SPARC4. We are investigating ways of shortening the training times. 

- The method is not particularly effective in detecting objects with different 
sizes which are in the same class, for example the haemorrhages in the retina 
pictures. This might be related to the simple pixel-based image features used. 
High level image features which contain size invariant information need to 
be investigated. 

- The classification strategy employed in the generated programs is not easy 
to determine. However, if a particular feature does not appear in a program 
that works well, it can be inferred that that feature is not important. 

- Some experimentation is required to find good values of the various param- 
eters for each different problem. 

- A threshold T (figure 3) was used to give “fixed” size ranges for determining 
the class of the object from the output of the program. We are investigating 
ways of finding individual thresholds for each class. 
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Abstract. We describe a stochastic Eilgorithm learning Boolean func- 
tions from positive and negative examples. The Boolean functions are 
represented by disjimctive normal form formulas. Given a target DNF F 
depending on n variables and a set of uniformly distributed positive 
and negative examples, our algorithm computes a hypothesis H that re- 
jects a given fraction of negative examples and has an e-bounded error 
on positive examples. The stochastic algorithm utilises logarithmic cool- 
ing schedules for inhomogeneous Markov chains. The paper focuses on 
experimental results and comparisons with a previous approach where 
all negative examples have to be rejected [4]. The computational ex- 
periments provide evidence that a relatively high percentage of correct 
classifications on additionally presented examples cam be aichieved, even 
when misclassifications are adlowed on negative examples. The detailed 
convergence analysis will be presented in a forthcoming paper [3] . 



1 Introduction 

The learnability of disjunctive normal forms (DNF) has been studied intensely 
over the past decade in the context of Valiant’s PAC learning model [22], For 
the general case of DNF, the learnability from positive and negative examples 
is still an open problem. Therefore, various subclasses have been studied within 
different modifications of the PAC learning model. Positive results on PAC learn- 
ability could be obtained for relatively restricted subclasses of DNF only, such 
as DNF formulae consisting of terms with a constant number of literals or a 
constant number of terms, respectively; see [12,22,23]. Another example of PAC 
learnable problem classes are constant-depth decision lists [20], which include the 
above-mentioned subclasses of DNF. In [21], the learnability of visual concepts 
has been investigated, which is closely related to the learning problem of DNF. 
The approach has been extended in [15] to a wider class of digital pictures and 
the authors obtained that DNF are PAC learnable as long as the set of terms 
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FKV 0352401N7. Part of the reseau’ch was done while the first author was visiting 
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satisfying a given example can be characterised in polynomial time. Thus, the 
number of terms as well as the number of literals are not bounded by a constant. 
In [2], it has been shown that learning all but an exponentially small fraction of 
DNF formulas is possible when the average term size is at most a constant, or at 
least a constant fraction J?(n) of the variables. Another line of research is dealing 
with various extensions of Valiant’s learning model, e.g., by equivalence and 
membership queries; see [5,17,19]. For an overview on DNF learnability and the 
discussion of different learning models, including a comprehensive list of refer- 
ences, we refer to [2]. 

In the PAC learning model, the probability distributions D{a) of positive 
and negative examples a e {O, l} are not specified and the hypothesis has to 
be found in polynomial time, i.e., the learning algorithm has to provide with 
probability at least I — 5 a. hypothesis with an error of at most e, regardless 
of D{a). In some papers, this condition has been weakened, i.e., subexponen- 
tial algorithms have been designed for special classes of DNF under the uniform 
probability distribution D{a) = 2“”. In [24], DNF consisting of terms 
are e— approximated with confidence 1 — J from positive and negative examples 
in time ^ under the uniform probability distribution. The learning 

procedure employs a greedy algorithm that originally was designed for approxi- 
mate solutions of A/’P-complete set-theoretic problems [13]. The time bound has 
been improved in [16] to however, by using the membership 

query model. Within the same model, Jackson [11] proved the existence of a 
polynomial time algorithm. For a uniform probability distribution D{a) = 2“’*, 
the sum of probabilities of tuples accepted by a polynomial number of terms of 
size I7(n) is exponentially small, i.e., the algorithm described in [2] is not ap- 
plicable and the learnability problem remains open for polynomial DNF which 
accept a constant fraction of all 2" input tuples. 

In the present paper, we consider essentially the same problem as in [16,24], 
and our algorithm is very similar to Verbeurgt’s approach [24], but the com- 
plete search for conjunctive terms of length 0(log n) maximising the number of 
satisfied positive examples is replaced by a simulated annealing search for a con- 
junction that satisfies at least a fixed positive example and rejects a pre-defined 
fraction / < 1 of all negative examples. The search is performed for any posi- 
tive example separately, and after all positive examples have been processed, the 
conjunctive terms with the largest number of satisfied positive examples which 
cover the entire set of positive examples are taken together. A good classification 
rate could be obtained if both the ratio of training examples per conjunction and 
the number of repeated trials to enhance the coverage of positive training exam- 
ples were sufficiently large. In particular, the correct classification of previously 
unknown negative examples depends significantly on the “multiplicity” of con- 
junctions of the hypothesis on the set of positive training examples. 

The present paper is a continuation of [4], where the special case / = 1 has 
been considered. A detailed convergence analysis of the underlying simulated 
annealing- based procedure is performed in [3]. 
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2 Bzisic Definitions 

2.1 The Approximation Problem 

We consider Boolean functions jF of n variables which can be represented by dis- 
junctive normal forms containing at most s — conjunctive terms. The cor- 
responding class of Boolean functions is denoted by := { F : F = Vie/ I I 

< s }, where the are conjunctions of literals x'^ £ {x\, ,Xn, • • • , 

For the number l(ti) of literals we have 1 < l(ti) < n. We denote by NEG, POS C 
{0, 1}" negative and positive examples of F, i.e., a G NEG ^ F{a) = 0 and 
a € POS F{a) = 1. 

We assume a uniform distribution D{a) := 2~” on the entire set of inputs 
a G {0,1}". The uniform distribution D{a) makes it possible to consider only 
relatively small values of l(t): If l(t) > logfe, where 1 < k < n, then t(a) = 1 
on = k~^ ■ 2" tuples a. Thus, if we try to approximate F G Fg 

and all terms t' with l(t') > log (s/e) are deleted from F, then the error is 
smaller than s ■ e/s = e. We note that the error caused by deleting terms of a 
certain length may occur only for “positive examples” of the original disjunctive 
normal form. Thus, it is sufficient to take into account only terms t of length 

< [log (s/e)]. Furthermore, it is assumed that all terms of the target DNF are 
extended to the length 1 q := [log (s/e)]. The corresponding increase of the num- 
ber of terms is upper bounded by 0{s^/e) = nP^^'>/e, i.e., instead of s one can 
consider s' = 0(s^/e). The corresponding set of DNFs is denoted by 

:= { F' : F' = V U, \ 1 1= O(sVe), /(<) = [log (a/e)] }. (1) 

t€/ 

We suppose that information about an element of is provided by m examples, 
|NEG| -I- |POS| = m. 



2.2 General Structure of Simulated Annealing 

Our stochastic approximation algorithm is based on the general framework of 
simulated annealing which was introduced in [8,14] as a new approach to cal- 
culate approximate solutions of combinatorial optimisation problems, where the 
underlying framework was based on Metropolis’ method [18] of computing 
equilibrium states for substances consisting of interacting molecules. Detailed 
information about this method and applications in different areas can be found 
in [1,7,10]. We will consider simulated annealing procedures for the special type 
of logarithmic cooling schedules for inhomogeneous Markov chains. 

Simulated annealing algorithms are acting within a configuration space in 
accordance with a certain neighbourhood structure, where the particular transi- 
tions between adjacent elements of the configuration space are controlled by the 
value of an objective function. Now, we are going to introduce these notions for 
the basic step of approximations of DNF F G F® which is related to the search 
for terms of length [log(s/e)] rejecting all negative examples. 
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To simplify notations, we consider the configuration space for each positive 
example a e POS particularly, i.e., we are concentrating on a single term as part 
of the entire hypothesis H for F e The underlying configuration space is 
defined by 

C^={^ : t(a) = l;lo<Z(i); |{77:^€NEG&t(77)=0}|> /-|NEG|},(2) 
where / < 1 is a global parameter. 

For the definition of the neighbourhood relation, we consider t, t' The 
two terms t, t' ^ 0, t ^ t' are called neighbouring, if t' can be generated from t 
by deleting a literal in t or adding to t, where CTi is the position of d. The 
set of neighbours of t is denoted by Aft, including t itself. Thus, we have 

lo < lMl<n + l. (3) 



The lower bound follows from the fact that at any step the 1 q critical variables 
can be added if they have been deleted at previous steps. 

The value 1 q = flog(s/e)] is the length of terms for hypotheses H related to 
F £ Fg. Therefore, the objective function can be defined by 



2(t) := Z(f)/lo. (4) 

The set of terms t G minimising Z is denoted by C}° . 

We denote by G[t, t'\ the probability of generating t' G Aft from t and by 
A\t, t'] the probability of accepting t' once it has been generated from t. Since 
we consider a single step of transitions, the value of G[t, t'] depends on the set 
Aft. We choose the uniform probability 



G[t, t'\ 



rirf’ ^ 

0, otherwise. 



( 5 ) 



As for G\t, t'], there are different possibilities for the choice of acceptance 
probabilities A\t,t'\. A. straightforward definition related to the underlying anal- 
ogy to thermodynamic systems is the following: 




1 , 



if Z{t') - Z{i) < 0, 
otherwise. 



( 6 ) 



where c is a control parameter having the interpretation of a temperature in an- 
nealing procedures. 

Finally, the probability of performing the transition between t and t' is de- 
fined by 



^ G\t, t'] ■ A[t, t'], if t' ^ t, 
1 - E G\t,t']-A[t,t'\. 






( 7 ) 
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By definition, the probability Pr{t — > t'} depends on the control parameter c. 

Let at{k) denote the probability of being in the configuration t after k steps 
performed for the same value of c. The probability at(fc) can be calculated in 
accordance with 



^t{k) ^ at>{k - 1) • Pr{t' t}. (8) 

t' 



The recursive application of (8) defines a Markov chain of probabilities at (A:), 
where t eC^ and k = 1,2, .... When the parameter c(A:) is a constant c = c{k), 
the chain is said to be a homogeneous Markov chain; otherwise, if c{k) is low- 
ered at any step, the sequence of probability vectors a(fc) is an inhomogeneous 
Markov chain. 

We briefly recall Hajek’s result [10] on logarithmic cooling schedules for 
inhomogeneous Markov chains, which can be easily verified for our specific min- 
imisation problem. Moreover, the parameters which are significant for the cooling 
schedules can be chosen in a straightforward way for the elements of . 

First, we need to introduce some parameters characterising local minima of 
the objective function: 

Definition 1 A configuration t' € is said to be reachable at height h from 
t € C^, if3to, ti, , tr € C^{to = t Atr = t') such that G[tu, tu+i] > 0, u = 
0, 1, ... , (r — 1) and Z{tu) < h, for all u = 0, 1, ... , r. 

We use the notation height{t t') < h for this property. The configuration t is 
a local minimum, if t s \Cjp and Z{t') > Z{t) for all f 6 M \L 

Definition 2 Let tmin denote a local minimum, then the depth depth{tmin) is 
the smallest h such that there exists at' satisfying Z{t') < Z{tmin) that is 
reachable at height Z{tmm) + h. 

We will use the following result obtained by B. Hajek: 



Theorem 1 [10] Given a cooling schedule defined by 



cik) = 



r 

ln(fe + 2)’ 



A: = 0, 1, ..., 



(9) 



the asymptotic convergence 3.t{k) — 1 of the simulated annealing algo- 

rithm, using (5), (6), and (7), is guaranteed if and only if 



(i) Vt, t' € Cl 3to, h, ... ,treCl{to=tAtr = t'): 
G[tu, tu+i] > 0, u = 0, 1, ... , {r - 1); 



(ii) V h : height{t => t') < h 4=^ height{t' =>t)< h; 



(iii) r > max depth{tmin) ■ 

^min 
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3 The DNF Approximation Algorithm 

3.1 The Search Algorithm for Conjunctive Terms 

Given a positive example cr of a function F & and a polynomial set NEG 
of negative examples, we employ a simulated annealing procedure to find a 
conjunctive term of length 1 q that satisfies t{a) = 1 as well as t(^) = 0 for 
/• I NEG I negative examples fj. The algorithm works on the set of parameters 
PAR ;= [n, s, /, e, c(A:)] and the input IN := [a, POS,NEG], where c{k) denotes 
the logarithmic cooling schedule (9) of the underlying inhomogeneous Markov 
chain. 

The algorithm can be described as follows: Prom n literals defined by a, an 
initial hypothesis tjnit ■'= a:"* • ■ • which rejects all negative examples. Then, 
we perform a simulated annealing procedure which is based on an inhomogeneous 
Markov chain. Thus, the structure of our algorithm term_approx(PAR,IN) ap- 
proximating a [log (s/e)] - term can be described as shown in Fig. 1. For our 

procedure term_approx(PAR,IN) ; 

k := 0; generate tinit := 

repeat 

Z{t) ■.= l{t)/\\og{s/£)]; 
repeat 

generate uniformly ti € { 1, ... , n }; 
if ( xZ'" not in t) then t' :=t A xZ'^ 
else 

t' := t\xZ'‘; 

until |{^: jj€NEG&t'(^) = 0}|> /• |NEG|; 

2(t'):=Kt')/[log(s/e)l; 
if Z{t') < Z{t) then t := t'\ 
else 

c := c(k); generate uniformly t] € [0, 1]; 
if exp( {Z{t) — Z{t'))/c) > T] then t := t'\ 
if l{t) = lo goto FIN; 
k := fc -I- 1; 
until k > fc(e,i); 

FIN: output tfin := t. 

Figure 1 

simulated annealing procedure, we have chosen the cooling schedule (9), and we 
suppose that T is a sufiiciently large value (see (12)). From the definition of the 
learning procedure and Theorem 1 we obtain: 

Theorem 2 The inhomogeneous Markov chain, which realises term_approx 
and is based on (5) till (7) and (9), tends to the distribution limfc_oo 

In our algorithm, we are searching for a representation by conjunctive terms of 
length lo, and therefore we obtain from the defining equation (4) of the objective 
function: 



l<2(t) < n/ [log (s/e)]. 



( 10 ) 
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By definition (see (1) and (2)), any positive example a can be represented by a 
conjunctive lo-term, i.e., when the current hypothesis t does not contain a subset 
of lo literals rejecting /• | NEG | negative examples, the term t has to be extended 
by at most 1 q literals to a new hypothesis t' satisfying the necessary condition. 
Thus, we have 



From t' to t' € C|j’, which represents a, There exists a sequence of transitions 
from t' to t' e that decreases the length at any step. Therefore, according 
to Definition 2 and Theorem 1, condition (iii), we obtain in our case the lower 
bound 



r> 1. 



(12) 



3.2 Approximations of DNF 

The procedure term_approx(PAR,IN) (see Fig. 1) has to be performed for a 
polynomial number of positive examples. Since the outcome might be the same 
for different a G POS, it remains only to collect all pairwise different conjunctive 
terms. 

The entire procedure dnf_approx(PAR,M,POS,NEG) can be described as 
shown by the simple structure from Fig. 2. The new parameter M defines the 
number of repeated trials for a given a G POS. The aim is to enhance the number 
of positive examples which are satisfied by U in addition to dj, i.e., M affects the 
“multiplicity” of U on POS. At the end of the procedure, only the terms with 
the highest multiplicity which cover the entire set POS are taken for the final 
hypothesis. 

Based on Hajek’s theorem [10] (see Theorem 1), we can prove the following 
result about the convergence speed [3]: 

Theorem 3 Given the minimal length of conjunctive terms lo = flog(s/£)], the 
condition 



k > log*^^^^ (1/^) + 

implies for 5 > 0 and arbitrary initial probability distributions a(0) the relation 

at(fc) < S, and therefore Bit'{k) > I — S. 

t^c'° t'ecl° 

We note that the time complexity of basic operations performed in term_approx 
and dnf_approx can be estimated roughly by n^. 
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4 Computational Experiments 

Both procedures term_approx(PAR,IN) and dnf_approx(PAR,M,POS,NEG) 
have been implemented and tested for a number of parameter settings. The 
implementation was running on a AcerAltoS 500 machine. 

The target DNF F was generated randomly for given values of n, s, and e. The 
value £ = 1/n was chosen, and 1 q was calculated according to Iq = [log (s/e)]; 
see (1). Then, for F and a given total number m of examples, an equal number 
of m/2 positive and m/2 negative examples was generated. We are presenting 
the outcome of computational experiments for two values of n: n = 32, n = 64, 
and s taking three values 8, 16, and 32. The two values of n and the number of 
conjunctive terms are larger than the corresponding numbers considered, e.g., 
in [9] (see Section 4.3 there). 

procedure dnf_approx(PAR,M,POS,NEG) 
begin 

enumerate POS from 1 till p :=|POS|; i := j := 1; 

H ~ H' -.= Empty DNF; 
repeat 

m := 0; 

ti := term_approx(PAR,[ai,NEG]); 
m'(ti) :=\{c I cr € POS&ti(5-) = 1}1; 
if m' > m then m = m'; 
j := j + 1; until j > M; 
if V t (t from H' -* U ^ t) then if' ;= if' V ti; 
m(ti) ~ m; i := i + 1; 
until i > p; 
repeat 

if Vt' ( t, t' from if' — » m(t) > rh(t') ) 
then delete t in if' and H := H W t; 
until Vo-3t ( cr € POS&t from if — » t(a) = l); 
output if ; 
end. 

Figure 2 

The hypotheses were computed using term_approx and dnf_approx. For 
each hypothesis H, additional m examples were generated randomly from the 
corresponding F (different from the examples used for the computation of the 
hypothesis) and H was evaluated on these sets of m/2 positive and m/2 negative 
examples. 

4.1 The Case of f = 1 

The test procedure has been repeated up to five times, and the average evalu- 
ation results are shown in Table 1 till Table 3 for the case of / = 1. As can 
be seen from Table 1 and Table 2, the error on additionally generated examples 
becomes smaller with an increasing number M of repeated trials to enhance the 
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n = 


32 s = 8 


/ = ! e 


= 1/n = 0.03125 lo = 


= 9 |POS| 


-1- |NEG| = 512 


M 


Examples 


Evaluation of 


Number 


Average 


Run-Time 




per 


H{d) and H{i)) for 


of 


Multiplicity (Seconds) 




Term of F 


F{a) = l 


F{fi) = Q Terms in H 


of Terms 




128 


64 


69.2% 


93.4% 


77 


6.9 


190 


256 


64 


73.2% 


92.4% 


76 


7.8 


366 


512 


64 


75.7% 


95.1% 


68 


8.3 


674 



Table 1 

multiplicity of conjunctions from the hypothesis. However, the average multi- 
plicity increases only within a small margin even by taking a larger value, which 
is four times of M. Furthermore, the correct classification increases significantly 
in case of a larger number of examples per conjunctive term of the target DNF. 

A smaller number of conjunctive terms in F and therefore a smaller 1 q (for 
a fixed e) implies an increase of the computation time because the search for a 
shorter conjunction rejecting all elements of NEG becomes more difficult. How- 
ever, the number of accepted positive examples increases for shorter conjunc- 



n = 


32 s = 8 


/ = 1 e = 1/n = 


3.03125 lo = 


= 9 |POS| + 


|NEG| = 1024 


M 


Examples 


Evaluation of 


Number 


Average 


Run-Time 




per 


H{ff) and H{r)) for 


of 


Multiplicity 


(Seconds) 




Term of F 


F(ff) = l F{fi) = Q 


Terms in H 


of Terms 




128 




89.8 % 91.2 % 


102 


10.9 




256 


128 


89.6 % 92.3 % 


96 


12.5 


1385 


512 


128 


92.3 % 95.8 % 


79 


13.9 


2820 



Table 2 



tions, i.e., the entire hypothesis becomes shorter which leads to better classifica- 
tions on NEG. With a decreasing average ratio of examples per conjunction of 



rt = 64 s = 16 / = 1 £ = 1/n = 0.015625 Ip = 11 



|POS| 


M 


Examples 


Evaluation of 


Number 


Average 


Run-Time 






per 


H{a) and H{fj) for 


of 


Multiplicity (Seconds) 


|NEG1 




Term of F 


F(a) = l F{f,) = Q 


Terms in H 


of Terms 








64 


22.7 % 92.0 % 


185 


4.8 


6131 






128 


43.0 % 89.3 % 


411 


5.2 


5927 


4096 


16 


256 


64.2 % 79.5 % 


697 


6.1 


5712 



Table 3 

the target DNF F, the classification becomes rapidly worse, although the average 
multiplicity does not differ significantly. Therefore, to obtain a good classifica- 
tion rate for large values of n and s, a very large number of training examples is 
required; see Table 3. On the other hand, it remains open to investigate the per- 
formance of our algorithm on more specific data with a nonuniform generation 
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probability, i.e., if the training data carry more information about the structure 
of the problem instance. 

The algorithm has been applied to large problem instances as well, e.g., 
n = 1024, s = 128, m = 1024, but only for small values of M and therefore a 
small resulting average multiplicity. 

As can bee seen from Table 1 till Table 3, the representation of the informa- 
tion about the target function (i.e., Iq multiplied by the number of terms of the 
hypothesis H) is significantly smaller compared to the trivial disjunction of all 
positive examples. 

4.2 The Case of f < 1 

We performed a number of runs in order to investigate the impact of the test 
t (NEG) = 0. Due to the uniform generation probability of negative examples, the 
hypotheses derived from a positive example reject with a high likelihood a large 
fraction of all negative examples. The comparison with Table 1 shows that there 



n = 


32 s = 8 M = 128 


£ = 1/n = 


0.03125 lo = 9 


|POS| -1- |NEG| = 512 


f 


Evaluation of 


Number 


Average 


Run-Time 




H{a) and H{fj) for 


of 


Multiplicity 


(Seconds) 




F(^) = l F{fj)=Q Terms in if 


of Terms 




0.75 


66.8% 92.2% 


89 


6.8 


178 


0.50 


71.5% 89.8% 


85 


6.7 


179 


0.25 


72.7% 93.4% 


92 


6.6 


215 



Table 4 



is no significant difference to the case when / = 1 is chosen, even for only 256 
negative examples. The average rejection rate is very high and close to 256. We 
think that an impact of the test t(NEG) = 0 might be expected for large values of 
the ratio n/lg, when the number of negative examples increases correspondingly. 

In Table 5, the length 1 q = logn has been taken, i.e., n/lo = 6.4, whereas 



n = 


32 s = 8 


M = 128 


lo = log n 


= 5 IPOSI-I- 


|NEG| =512 


f 


Eveiluation of Number 

if (a) and H{fj) for of 

F{a) = 1 F{fj) = 0 Terms in H 


Average 
Multiplicity 
of Terms 


Run-Time 

(Seconds) 


0.75 


94.1 % 


66.1% 


56 


22.7 


539 


0.50 


87.1% 


63.3% 


68 


23.9 


491 


0.25 


85.9% 


59.4% 


57 


21.8 


467 



Table 5 



n/lo = 3.6 in Table 4. As can be seen, the correct classification on untrained 
negative examples becomes worse compared to the smaller ratio of Table 4. The 
results on negative examples are to be expect worse also due to the short length 
lo itself, not only due to the larger ratio n/lo. 
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5 Concluding Remarks 

A simulated annealing- based learning algorithm has been presented that gener- 
ates hypotheses about Boolean functions from positive and negative examples 
alone. The algorithm utilises Hajek’s Theorem on the convergence of logarith- 
mic cooling schedules. The convergence speed of inhomogeneous Markov chains 
depends on the maximum value of the minimum escape depth from local min- 
ima of the underlying energy landscape. The algorithm has been implemented 
and tested on a number of small examples. Our computational experiments have 
shown that a relatively high percentage of correct classifications on additionally 
presented examples can be achieved, even in the case when misclassifications 
up to a certain degree are allowed on negative examples. The complexity of the 
hypotheses is significantly smaller than the trivial disjunction of all positive ex- 
amples. Further research will concentrate on nonuniform generation probabilities 
of examples, i.e., when more information is available about the structure of the 
problem instance. 
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Abstract. This paper presents Mutant, a learning system for 
autonomous agents. Mutant is an adaptive control architecture founded 
on genetic techniques and reinforcement learning. The system eJlows an 
agent to learn some complex tasks without requiring its designer to fully 
specify how they should be cauried out. An agent behavior is defined 
by a set of rules, genetically encoded. The rules Eire evolved over time 
by a genetic algorithm to synthesize some new better rules according 
to their respective Eidaptive function, computed by progressive reinforce- 
ments. The system is validated through an experimentation in collective 
robotics. 

Keywords: control, genetic algorithms, reinforcement learning, multiar 
gent systems. 



1 Introduction to Multiagent Learning 

1.1 The Viability Concept 

Learning theory in multiagent domain essentially comes from artificial life. In- 
deed, this emerging research area focuses on behavioral aspects of agent ar- 
chitectures, including autonomy and self-adaptation, founded on the viability 
problematic. The notion of viability could be transposed, in multiagent domain, 
as a satisfaction function. Robots, animats or agents, in order to satisfy their 
objectives during their evolution /, have to keep their state a within a sphere of 
viability K in such a way that; cr(t-t-l) = /(o’{t)) to the extent that Vct, /(a) 6 K. 

Therefore, they need to be equipped with sensors and effectors suitable for the 
constraints induced by their environment, as well as a control system responsible 
for selecting a satisfying action to execute, at the good time, in accordance 
with their goals. The sensors provide the agent with the capability to intercept 
sensorial information from its external environment or its internal state, while the 
effectors afford the agent the means to act within its environment. The control 
system coordinates perceptions and actions. The behavior of an agent may be 
qualified as adaptive while its control system is able to keep /(cr) in K. The 
figure 1 represents a robot moving on a ground scattered with holes. Suppose 
it has to learn to move in this environment by avoiding the holes. Its state a{t) 
may be expressed in terms of two variables e\{t) and S 2 {t) that vary over time. 
The sphere of viability K may be considered as the region of the states space 
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where a should be kept. The outer side of K corresponds to states that may 
endanger the survival or the goal of this robot. The sensors of the robot should 
indicate to itself that there is a hole in front of it, and its control system then has 
to choose which action to accomplish. Thus, at point P, a{t) risks of leaving K 
in a'(t + At) if the agent moves towards the hole. An adaptive behavior would 
be the one that performs a corrective action on this state transition so as to 
transform a in a{t + At), which is inside of K, by passing round the hole. 




Fig. 1. an adaptive perception-control-action triad 



1.2 Hard- Wired Architectures 

A preliminary approach to design animats consists in coupling filtered sensors 
directly to the appropriate action. The hard problem lies neither at the sensor nor 
at the integration level, but at the level of arbitrating among potentially many 
actions and selecting one [BroQla]. In the majority of behavior-based systems 
the solution is a built-in, fixed control hierarchy imposing a priority ordering on 
the behaviors. It ensures that only one of the behaviors can have control over 
the effectors, and that control is always appropriated to the behavior with the 
highest priority within the given context. Many of those are based on selecting 
an action by computing a multivariable function implicitly encoded in behavior 
activation levels. By spreading activation throughout the behavior networks and 
using carefully chosen thresholds for triggering actions, action selection can be 
tuned and even learned [Bro91b]. 

Nevertheless, the main weakness of this kind of architectures lies on their 
inefficiency to cope with unexpected situations, namely those which have not 
been planned by the designer. However, it may be imperative to exhibit robust 
behavior in complex and highly dynamic environment, and to be capable of 
reacting in case of non-hardwired situation. Furthermore, it may be salutary 
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to be inventive ! It is obvious that fixed approaches do not cope with such 
capabilities. In addition, the design process often implies long adjustment phases 
to identify, define, add and tune required skills to implement in the system. 
Thus it may be very useful to rely on generic methods that can be adapted in 
various domains and does not make any assumptions on the tackled problem. In 
such a way, the design process can be significantly reduced by taking advantage 
of the reusability of such methods. In other respect, multidisciplinary research 
repeatedly involves similar but not identical design to adapt a system from a 
domain to another. A generic approach is then strongly recommended, especially 
if we consider the increasing complexity of tasks that autonomous systems have 
to face nowadays, that prohibits more and more built-in approaches. 



1.3 Reinforcement Learning 

The most employed techniques for adaptive coordination of perception and ac- 
tion in autonomous agents design are usually based on reinforcement learning. 
Reinforcement is used by agents to improve their performance at reaching their 
goals by practice or experience [Ka«96]. A weight factor is assigned to each 
condition-action pair of the agent’s repertory, and constitutes the discriminator 
in the action selection process, since the behavioral control mechanism select 
the pair with an applicable condition that has the highest weight. The values 
of weights fluctuate over time, in accordance with some rewards and punish- 
ments, reflecting a credit assignment to the current behavior coming from the 
environment. 

These techniques allow to not freeze the coupling between the environmental 
context and the way the agent may react. [Mat97] proposes a table-based learning 
system which updates a condition-action reinforcement matrix over time. Each 
entry R{c,a) reflects the propensity for the action a to be performed in the 
context c. R is computed as an heterogeneous reinforcement function combining 
immediate and delayed payoffs. The immediate payoffs correspond to immediate 
(un)satisfactions, whereas the delayed ones are computed from some progress 
estimators. In this way, she manages to progressively contract the links between 
suitable conditions and actions. Figure 2 illustrates this kind of learning system. 
The reinforcement matrix can be expressed as an oriented graph where each 
edge Ci aj is weighted by the reinforcement value R(ci,aj). 

One of the famous extension of this kind of reinforcement is those known 
as Q-Leaming [WD92]. The Q-Learning algorithm is a type of reinforcement 
learning in which the value of taking each possible action o in each situation c is 
represented as a utility function Q{c, a). Watkins proved its algorithm converges 
to the optimal policy under the conditions that all Q{c,a) values are computed 
in a table, and the learning set includes an infinite number of episodes for each 
state c and action o. 

This kind of approaches is highly limited by the size of the search space. 
Indeed, the learning system has to maintain the weight of all potential combi- 
nations between each condition c and each action a. Then, in order to reduce 
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Fig. 2. reinforcement learning 



the complexity of the search space, both the implemented conditions and ac- 
tions must be of high abstraction level. Additionally, it may be crucial to trigger 
an action if more than one of the implemented conditions are gathered. In this 
case, the designer must adds new entries in the matrix for new conditions that 
implement the combinations of some others. Considering the whole of possible 
combinations, it seems obvious that the amount of knowledge the agent must 
update over time may become very huge. Nevertheless, large parts of it becomes 
progressively totally unuseful as the learning advances. 

Consequently, we propose an adaptive control architecture able to shape the 
agent behavior according to its environmental context. Instead of being omni- 
scient and rapidly swamped with the size of the search space, as those of Mataric, 
our architecture has been designed to handle a reasonable amount of knowledge 
in the same time, and to be able however to broaden its knowledge, by exploring 
the search space. This architecture, named Mutant, is founded on the selective 
exploration mechanisms of genetic techniques and is detailed in the next section. 

2 Mutant: A Cyber (ge)netic Model 

2.1 Genes of the Behavior 

An important stage in autonomous agent design consists in listing the set of 
basis behaviors to implement. A basis behavior set should contain only behav- 
iors that are necessary in the sense that each either achieves, or help achieve, a 
relevant goal that cannot be achieved with other behaviors in the set and can- 
not be reduced to them. Furthermore, a basis behavior set should be sufficient 
for accomplishing the goals in a given domain so no other basis behaviors are 
necessary. The construction of such a set highly depends on the tackled domain, 
and furthermore on both the perceptual and active capabilities of the agents, 
namely on the kind of both sensors and effectors they are equipped with. 

By another way, we have to keep in mind that the lack of accurate and reliable 
sensors is arguably the most common complaint of researchers in situated agent 
control and learning. In robotics, in particular, sensors have been targeted as one 
of the limiting factors in the way of progress towards more complex autonomous 
behavior. Most of the commonly used sensors provide noisy data and are difficult 
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to accurately characterize and model, presenting a major challenge for real-time 
robot learning [Mat97]. These constraints must be considered in order to provide 
researchers with reusable techniques. That’s why we assume, in our work, that 
agents are provided with sensors that only perceive low-level information, like 
distance towards objects, or contact with obstacles, etc. This may be considered 
as an heavy disadvantage, especially from the multiagent perspective where the 
ability to sense and correctly distinguish members of one’s group from others, 
from obstacles and from various features in the environment is crucial for most 
tasks. Actually, the lack of sophisticated perceptual discrimination is a critical 
limitation in multi-robot work. Non-visual sensors such as infra-red, contact 
sensors and sonars are all of limited use in the social recognition task. This kind 
of difficulties is termed as the hidden state problem [WB91]. If viewed as a form 
of sensing, communication in multiagent systems can be used to effectively deal 
with the hidden state problem. Like other sensors, radio transceivers perceives 
signals and pass those on for further processing [Mat98]. Specific properties 
of sensors vary greatly, and these differences can be usefully exploited: some 
information that is very difficult to directly sense can be easily communicated. 

Anyway, when a sensor catch a signal from the environment, the resulting 
quantum of information is digitalized in a raw state. As shown in figure 3, it 
must be filtered and read with others, then compiled into high-level informative 
cells. These cells constitute the interpretable information for the agent. Actually, 
they play the role of excitating elements able to sollicitate the control system 
and to trigger actions. In other words they play the role of stimuli. The whole of 
such stimuli thus constitutes the substrate of prerequesite conditions for action 
triggerring in the behavior control system. 



low-level 

information 




Fig. 3. from perception to stimuli synthesis 



Once the set of relevant stimuli has been identified, the basis actions set 
can be determined on its turn. The actions to implement strongly depends on 
available effectors. On the opposite to sensors, the effectors allow to act on the 
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environment. Thus, basis actions should trigger the effectors by sending them 
some suitable impulse signals. That is the sum of such simultaneous impulses 
which conditions the resulting behavior of the agent. If the control system of 
a robot sends uniform impulses both to its engine and to its rotator, then the 
robot will move by describing an arc of a circle. 

The sets of stimuli and basis actions thus constitute the repertory of el- 
ementary bricks available to build the global behavior of the agent. In other 
words, these bricks are the elementay units of its behavioral program, exactly 
as the genes are the bricks of the genetic program of living beings. The simple 
combination of these primitives then allow an agent to develop some complex 
behavior. 

The main difficulty being the way of connecting these genes. How to associate 
some sensorial genes with the suitable actions, and how to coordinate these 
stimuli-action pairs in a good policy, and in an adaptive way? To respond to 
these issues. Mutant is defined as a genetic classifier system. 

The classifiers have been introduced by John Holland as the theoretical foun- 
dations of learning with genetic algorithms. They are programs able to learn rules 
to optimize the performance of a system or reinforce its autonomy. A genetic 
classifier is composed of three main elements; a rules system, a credit assignment 
system and a genetic algorithm [Hol92]. Both the rules and the credit assign- 
ment systems are domain dependant, whereas the genetic algorithm is generic. 
The efficiency of a genetic algorithm is not magic. John Holland has proposed 
an interpretation based on the notion of schemes. A scheme is a mathematical 
object able to express the similarities between genetic codes. It’s a genetic pat- 
tern. Individuals, be they good {adapted) or not, are generally unstables since 
they may disappear from a generation to another under the application of the 
genetic crossing-over. A good algorithm will thus be the one that save the best 
schemes, generation by generation. Consequently, schemes represent the sub- 
stance on which a genetic algorithm works. Its efficiency depends essentially on 
the choice of the encoding that exhibits some schemes that are not destroyed by 
recombinations, but, on the contrary, progressively improved from a generation 
to the next. 

Mutant is thus composed of; 

— a set of behavioral rules, genetically encoded as chromosomes, 

— a system of rules activation, 

— a genetic algorithm to set up a competition between the rules, 

— and a credit assignment system to measure the adaptive function of each 

rule. 

Let us focus, in the next section, on the genetic encoding of behavioral rules. 



2.2 The Chromosome: A Genetic Encoding of the Behavior 

Each rule Ri is defined as a combination of a predicate Pi and an action Aj in 
such a way that, if Pi is true, then A, is considered as applicable. A predicate Pi 
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Fig. 4. the genetic structure of behavioral rules 



is a multivariable boolean function that combines the stimuli perceived from 
the environment in a tree where nodes are boolean operators and leaves are 
stimuli. In such a way, complex environmental situations may be expressed, 
even from a reduced set of recognizable stimuli. When stimuli are sensed from 
the environment, they are propagated through the boolean tree, from the leaves, 
and filtered by the boolean operators so as to be collected as a single boolean 
value at the root of Pj. This value then makes the corresponding actions Ai 
applicable or not, depending on its value. 

The action Ai corresponds to a basis skill of the agent, namely a behavioral 
primitive. The implementation of this primitive indicates which effectors of the 
agent should be activated, and how. For instance, if the action is to turn around 
in the case of a robot, then both its engine and its rotator will be activated with 
uniform signals. We argue that such a set of rules is sufficient to express a wide 
variety of complex behaviors by combination. 

The structure of these rules is shown in figure 4. 

Pi constitutes a necessary precondition for the application of the rule Ri. 
However, we will see in the next section that, even if it is a necessary condition 
for the actual triggering of actions, this notion of applicability is not sufficient, 
and the system of rules activation requires further features. 

2.3 The System of Rules Activation 

Each rule Ri is provided with two factors for the determination of its activation: 
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— an activation level a{ that represents the rule’s propensity to be activated, 

— an inhibiting threshold at that determines wether R{ should be actually ac- 
tivated or not, depending on the value of aj against Oi: if ccj > CTj then Ri is 
activated, and inhibited otherwise. 

At each cycle, the system of rules activation collects the set Rap of applicable 
rules; 

Rap = {Ri • P% — » Aj, where Pj is true} 

Among these rules, it then determines the set Rac of rules that should be acti- 
vated: 

Rac = {Pi e Rap where cni > cri} 

Each rule Ri of Rac is then activated, that is to say that the activation system 
triggers the associated actions A,, and the others are inhibited. Thus, unlike 
the table-based learning systems as those presented in section 1.3 that choose a 
single action to perform at each behavioral cycle, MUTANT is intrinsically parallel 
and able to trigger several actions at the same cycle. The parallel triggering of 
actions Aj has the effect of concurrently stimulating the associated effectors 
in the same cycle. The resulting impulse sent to each effector is thus a signal 
combination of all impulses stem from eictions execution. 

In the case of none rule may be activated, then the activation system chooses 
a random rule to be activated. 

When a rule is activated, its activation Oi level is decreased by a factor doj. By 
this way, a single rule cannot be triggered indefinitely, and lets the opportunity, 
after a sufficient time, to others to be acti\^ted at their turn. This corresponds, 
in such a way, to an accustoming phenomenon, that progressively inhibits the 
response to a same stimulus. 

Symetrically, the da, factor is split up into equal parts da^ and shared among 
all the inhibited rules Rj ^ Rac- This phenomenon corresponds to a progressive 
increasing of the motivation of inactive rules. Accustoming and motivation are 
thus opposite phenomena that regulate the equilibrium between the rules for 
the activation process. Although being stem from apart considerations, this al- 
gorithm presents some likeness with the Bucket Brigade, proposed by Holland 
in [Hol92]. 



2.4 The Genetic Algorithm 

The genetic algorithm is used to impose an evolutionary competition among the 
rules, and to synthesize some new rules from the best ones available. Actually, 
each rule Ri is weighted by an adaptive function Wi that indicates its efficiency for 
the agent evolution. In other words, Wj reflects the propensity of Ri to maintain 
the state of the agent within its sphere of viability (see section 1.1). This function 
is computed as the difference between aj and (n: 
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Initially, each agent is provided by a set of random rules, where each ftictor 
is randomly generated. At each genetic cycle, the genetic algorithm selects the 
best adapted rules (those with highest weights). These rules are proposed to the 
reproduction system to generate some clones of them that will replace the bad 
rules. Then, in accordance with the rank selection method, proposed in [Bak85], 
that determines the number of clones that each rule should provide, the rules 
proposed for reproduction are cloned in correlation with their respective weight. 
The new set of rules is then processed by the genetic operators: crossing-over 
and mutation. 

The crossing-over operator is applied on the respective predicates Pi and Pj 
of two rules Ri and Rj with the same action part A. It extracts two sub-trees 
TTi and TTj from Pi and Pj and swap them. The number of pairs [J?»,i2j] which 
participates to a crossing-over depends on a crossing-rate Tc. 

The mutation operator allows to obtain some randomly VEiriations on the 
selected rules in order to avoid a premature convergence. The operator works by 
randomly selecting a subtree in the predicate part of a rule, and by replacing it 
by another subtree, generated by a stochastic process, whose depth is controlled 
by a limiting factor. 

The figure 5 shows the techniques of these two genetic operators. The genetic 
paradigm which is utilized here has been introduced by John Koza and is better 
known as genetic programming [Koz89]. 







Fig. 5. the genetic operators 
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The final step in the adaptive control system designing concerns the credit 
assignment. This is the process by which the rules may be rewarded or punished 
in accordance with the benefits they bring about to the agents. 



2.5 The Credit Assignment System 

The credit assignment system of Mutant is a reinforcement algorithm. A po- 
tential problem in traditionnal reinforcement learning is that the payoffs are 
generally delayed. The agent must successfully complete a sequence of steps be- 
fore receiveing a reward. This makes credit assignment in the intervening steps 
more difficult. To address this issue, we propose a alternate reward scheme where 
the agent is provided intermediate rewards as it carries out the task. 

We consider that, rather than coming from the environment, as implemented 
in many programs, the reinforcement must be intrinsically deduced by the agent 
itself, from satisfaction or disapointment indicators. It then should be able to 
evaluate something like a progress, stationary or regression estimation after a 
state transition, in accordance with its goals. This evaluation is highly domain 
dependant. For instance, a robot trying to reach a target in the environment will 
evaluate an increase if it gets closer to this target, or a decline if it moves away 
from it, or a stagnation otherwise. 

The difficulty for the designer is thus to indentify the set of reinforcements 
associated to each relevant event that may indicate a state variation that is useful 
for the improvement of the behavior. In a general way, if M (t) is a measure of the 
state at time t, and ri(t) the payoff attributed at time t to the rule Ri triggered 
at time t — 1 , then: 

- ri{t) > 0 iS M{t) > M{t-1) 

- ri{t) < 0 iff M(t) < M{t — 1) 

- 0 otherwise. 

Actually, this payoff directly corresponds to a variation of the inhibiting 
threshold cTj of Ri as following: 

ai{t) = ai{t - 1) - ri{t) 

By this way, if the agent receives a positive payoff, then the inhibiting threshold 
of the rule responsible for the state transition between times t — 1 and t is 
decreased. This phenomenon corresponds to an increase of the motivation of 
the agent to perform this rule in the same conditions. On the opposite, if the 
agent receives a negative payoff, then the inhibiting threshold of the active rule 
is increased, in order to decrease the motivation of the agent to perform its last 
action in the same conditions. 

The fluctuation in thresholds CTj then causes the fluctuation in weights Wj. 
This guarantees the competition between the rules over time, which is the sine 
qua non feature for the adaptability of the agent behavior. 
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3 Experiments and Results 

We chose, in the scope of this paper, to experiment the MUTANT architecture in 
collective robotics domain, through a foraging robots system. We thus have devel- 
oped a simulator for autonomous robots. This simulator has been implemented 
using Jaafaar: a kernel we have developed for generic multiagent simulation. 



Table 1. parameters of the simulation 



stimuli 


actions 


payoffs 


events 


is-base_visible? 


wander 


1 


puck delivered 


is_puck_visible? 


move-forward 


1 


puck picked-up 


is_close_to_puck? 


turnjonJeft 


-3 


puck dropped outside 


is-close-to_base? 


turnjonjright 


0.5 


puck pushed to base 


is_puck.within_base? 




-0.5 


puck pushed away from base 



at the beginning 

robot pucks 




the pucks are scattered in the environment 



in progress... 

the robots have learned an efficient beahvior 




Fig. 6. simulation snapshots 



Four robots are equipped with a vision sensor, able to discriminate a home 
region (their base) and some pucks, scattered in the environment. Each robot 
is also equipped with both an engine to move forward and a rotator to turn. 
The task this group of robots has to learn is to wander in the environment until 
finding a puck, then pushing the puck into the base. At the beginning of the 
simulation, the set of behavioral rules of each robot is generated randomly, that 
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Fig. 7. simulation results 



is to say that their knowledge is totally flimsy. The table 1 lists the parameters of 
the simulation: the set of basis behaviors, the payoffs and corresponding events. 
The figure 6 shows some snapshots of the simulation. And the figure 7 presents 
how performance improves as the agents learn over time (about 1000 simulations 
have been averaged). The sudden drops in curves is a parasite phenomenon due 
to the fact that when agents have delivered a puck to the base, they may push 
the others outside because they have not been provided with behavior to avoid 
this. 
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Abstract. Computing least common subsumers in description logics is 
an important reasoning service useful for a number of applications. As 
shown in the literature, this reasoning service can be used for the approx- 
imation of concept disjunctions in description logics, the “bottom-up” 
construction of knowledge bases, learning tasks, and for specific kinds of 
information retrieval. So far, computing the least common subsumer has 
been restricted to description logics with rather limited expressivity. In 
this article, we continue recent research on extending this operation to 
more complex languages and present a least common subsumer operator 
for the expressive description logic ACEMTt. 



1 Introduction 

Knowledge representation languages based on description logics (DLs) have 
proven to be a useful means for representing the terminological knowledge of 
an application domain in a structured and formally well understood way [6]. 
In DLs, knowledge bases are formed out of concepts representing sets of indi- 
viduals. Complex concepts are built out of atomic concepts and atomic roles 
(representing binary relations between individuals) using the concept construc- 
tors provided by the DL language. For example, the set of grandmothers can 
be described using the atomic concepts woman and parent and the atomic role 
has-child: woman fl 3 has-child. parent. 

A central feature of knowledge representation systems based on DLs is a 
set of reasoning services with the ability to deduce implicit knowledge from 
explicitly represented knowledge. For instance, the subsumption relation between 
two concepts can be determined. Intuitively speaking, a concept C subsumes a 
concept D if the set of individuals represented by C is a superset of the set of 
individuals represented by D, i.e., if C is more general than D. 

As another reasoning service, the least common subsumer (LCS) operation, 
applied to concepts C and D, computes the most specific concept (from the 
infinite set of all concepts) which subsumes C and D. The LCS is an important 
reasoning service useful for a number of applications. Cohen et al. consider an 
LCS operator for learning tasks [4] and in order to approximate a disjunction 
operator in the DL ACM with feature chain equality because the disjunction op- 
erator is not explicitly present. Baader et al. use the LCS for the “bottom-up” 
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construction of KBs based on the DLs ACM [1] and ALE [2]. In [8], Moller et al. 
apply the operator to similarity- based information retrieval where the LCS for- 
malizes the notion of “similarities” of concepts. As recent literature shows, there 
is a tendency to extend this reasoning service to more and more expressive DL 
languages. 

The main contribution of this paper is the proposal of an LCS operator for the 
expressive DL ACENlZ consisting of the top and bottom concept, atoms, nega- 
tions of atoms, concept conjunctions, 3-, V-quantifications, role conjunctions, 
and number restrictions. The constructors offered by ACEMTZ have proven to 
be useful in our similarity-based information retrieval applications. 

Cohen, Borgida, and Hirsh [3] showed that if a unique least upper bound 
operation on the set of arguments of each concept forming operator is available 
and subsumption between two concepts can be computed structurally, their LCS 
can be determined by a simple recursive algorithm. The first task can be accom- 
plished rather easily, whereas great efforts are necessarily in order to transform 
an ACSMTZ concept into structural subsumption normal form. Therefore, the 
main object of this article is to give a definition of such a normal form and pro- 
vide an algorithm to compute it for ALEN'R. concepts. Technical details of the 
presented work including complete proofs are available in [7] . 

2 Preliminaries 

In this section, we review the definition and some properties of the DL ACEAfTZ 
(e.g., considered in [5,6]). 

Definition 1. Let C be a set of atomic concepts and Ti a set of atomic roles 
disjoint from C. (ACEMTZ-) concepts are recursively defined as follows: 

- The symbols T and ± are concepts (top concept, bottom concept). 

- A and ~^A are concepts for each A G. C (atomic concept, negated atomic 
concept). 

- If C and D are concepts, ReR. is an atomic role, and n e NU {0}, then 

• Cr\D (concept conjunction), 

• 3 R.C (existential role quantification), 

• 'i R.C (universal role quantification), 

• {> n R) (> -restriction), and 

• {< n R) (<-restriction) 
are also concepts. 

- If R and S are roles, then RnS is a role (role conjunction). 

A subexpression of a concept C is a substring of C that qualifies as a concept. 
The semantics of a concept is defined in terms of an interpretation. 

Definition 2. An interpretation I = {A? , A') of a concept consists of a non- 
empty set (the domain ofX) and an interpretation function The inter- 
pretation function maps every atomic concept A to a subset A^ C and every 
role R to a subset RF' C A^ x A^. The interpretation function is recursively 
extended to a complex concept as follows. Assume that AF, C^, IF and RF, 
are already given and n G IN U {0}. Then 
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- := 1? := 0, {-^Af := A^ \ A^, 

- (C n Df := n D^, (R n Sf := n S^, 

- 3 R.C^ ■.= {a e A^\3b : (a,b) e R^ Ab e C^}, 

- V R.C^ := {a e ^^IVb : {a,b) e R^ ^ b e C^}, 

- (> n R)^ := {a G A^\fl{b\{a,b) e R^} > n}, and 

- (< nR}^ := {a 6 2 i^jtt{ 6 |(a, 6 ) e R^} < n}. 

An interpretation Z is a model of a concept C iff ^0. If C has a model, C 
is called consistent. A role R is a subrole of a role S iff, for all interpretations 
X, R-^ C S^. A concept C is subsumed by a concept D (C Q D) iff C 
holds for all interpretations I of C and D. C is equivalent to D (C = D) iff 
CQD ADQC. 

Note that both constructors T and ± are expressible by (> 0 i?) and A fl -<A, 
respectively. For some explanations of the algorithms presented subsequently, we 
introduce the concept depth. 

Definition 3. The depth of a concept C is recursively defined over its structure. 

- If C = 3 R.C or C = WR.C', then depth{C) = 1 + depth{C). 

- // C = Cl n • • • n C„, then depth{C) = max({depth(Ci)|l < i < n}). 

- In all other cases, depth{C) = 0. 

A depth n subexpression of a concept C is a subexpression of C that occurs on 
depth n of C. Furthermore, for a consistent concept C in which a role R occurs 
as a substring, we say that C has an i?-successor if, in every model of C, there 
is an individual which functions as an i?-successor. 

Definition 4. Given a consistent concept C and a role R occurring in C, we 
say that C has an ii-successor iff, for all models X of C, there are individuals 
i,j such that (i,j) G RF. 

The LCS of two concepts C and D is defined as the set of most specific 
concepts which subsume both C and D. 

Definition 5. Let C and D be concepts. Then we define the set of least common 
subsumers as: 

lcs{C, D):^{E\CCEaDQE A 

WE' :CQE' AD QE' => EQ E'}. 

From this definition it follows immediately that, for concepts C and D, all 
pairs of elements of lcs{C,D) are equivalent. Due to this uniqueness prop- 
erty, we will consider lcs{C,D) as a concept rather than a set of concepts 
in the following. Definition 5 can straightforwardly be extended to n argu- 
ments. In this case, Ics is associative and commutative and lcs{Ci , . . . , Cn) = 
lcs{Ci,lcs{C 2 , . . .lcs{Cn-i,Cn) ■ ■ ■))• In our analysis, it will be convenient to 
define the most specific role of two role conjunctions. 
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Definition 6. Let R = rii<i<„iii and S = be roles, = {i?i, . . . , 

Rn}, fltid © = {5i, . . . Then we define the most specific role of R and S 

as a partial function: , 

msr{R,S) := 

[ undefined otherwise. 



Definition 6 can easily be extended to more than two role arguments. In the next 
section, we will define a normal form for concepts with which the subsumption 
problem for ALENR can be decided by a structural subsumption algorithm. 

3 Structural Subsumption Normal Forms 



We are interested in a normal form with which the subsumption relation between 
pairs of concepts can be decided by structural comparisons. It will be convenient 
to order the concept components w.r.t. the concept forming operators. 

Definition 7. An concept C is in sorted normal form (SNF) iff C 
^ = C = AnEr\FnL\lM with 

E = rii<i<„3 iii.Ci, 

F = rii<i<„,V Ri-Ci, 

L = rii<i<p(> m Si), and 
M = ni<i<,(< rui S'-) 

where A is an arbitrary conjunction of atomic and negated atomic concepts 
and Ci and Ci are also in SNF. Furthermore, we assume all nested conjunctions 
to be flattened, i. e., .An(J3nC) —^Ar\Br\C, conjunctions of 'i- quantifications 
to be factorized, i.e., Vi?.CnV/2.D — > \/R.{Cr\D), \/- quantifications to be spread 
over other quantifications, i.e., V R.C D V S.D — »• V R.C □ V S.(C □ D) if S 
is a subrole of R, and \f- quantifications to be spread over 3- quantifications, i.e., 
3 R.C n V S.D — > 3 i?.(C n D) n V S.D if R is a subrole of S. 

We will now give an algorithm with which subsumption between concepts can 
be computed by structural comparisons. In Algorithm 1, the test if (7 = T 

Algorithm 1 structural-subsumption(C, D) 

Let C = Cl n ■ ■ • n Cn and Z) = £>i n • • • n Dm. Then Z? C C iff C = T or Z) = J. or, 
for all Ci: 

1. if Ci is an atomic concept, then there exists a Dj such that Ci = Dj, 

2. if Ci is of the form 3 R.C' (V R.C'), then there exists a Dj of the form 3 S.D' 
iy S.D') such that S (R) is a subrole of R (S) and D' C C' , 

3. if Ci is of the form (> n R) [(< n A)], then there exists a Dj of the form 
(> m S) [(< m S)] such that n <m {m <n) and S (R) is a subrole of R (S). 

(C = ±) can be performed by the subsumption (satisfiability) algorithm for the 
DL ACCMR [5]. Note that the Algorithm does not necessarily return the correct 
result for arbitrary ACEMR. concepts. In the sequel, we will define a normal 



= T or 
( 1 ) 
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form for concepts C and D such that structural- subsumption{C, D) returns true 
iff D QC. We will motivate the definition by an example. 

When determining the subsumption relationship between concepts, several 
kinds of possible “interactions" between concept forming operators must be con- 
sidered. These interactions are pointed out by the following example. Let Xi := 
3i?.(^nS), X 2 := '^R.(AnB), X 3 := (> li?), and Y := 3R.An3R.Bn{< IR). 
Then, the invocations structural- subsumption{Xi,Y), structural-subsumption 
[X 2 ,Y), and structural-subsumption{Xz,Y) all return false, even though Y C 
Xi,YOX 2 , and Y O X 3 hold. The reason is that Y implies the 3-quantification 
3i?.(j4nS), the V-quantification Vi2.(j4nB), and the >-restriction (> IR). These 
concepts are not explicitly present as subexpressions of Y, but their existence 
would be relevant for the structural subsumption algorithm to work correctly. 

As a consequence, in order to transform a concept C into a normal form C 
such that subsumption can be decided by structural comparisons, the idea is 
to make the relevant implicitly contained information explicit and conjunctively 
add it to C in the form of additional concept components. Since C contains 
no “new” information w.r.t. C, semantical equivalence between C and C is 
guaranteed. For instance, if we define Y := F □ 3 R.{A D B) fl V R.{A fl 5) fl 
(> 1 R),X\ ATi n (> 1 R),X 2 := X 2 , and X 3 := X 3 , the invocations 
structural-subsumption{Xi,Y), structural-subsumption{X 2 ,Y), and structural- 
subsumption{X 3 ,Y) all returns true as desired. Obviously, atomic concept com- 
ponents do not influence the process of computing the structural subsumption 
normal form of a concept and there are no relevant <-restrictions that need to 
be made explicit. 

We will now generalize the ideas pointed out by the above example in order 
to get a definition of a structural subsumption normal form of a concept. For a 
set of concepts S, we define as the leirgest subset of «S in which no pair of 
equivalent concepts occurs. 

Definition 8. Let C be a concept in SNF given by (1). Then, we first define 
£c := {3 R.D\C \l3R.D = Ch V(3 R.D') : C n 3 R.D' = C^DQ D'}, 

Cc '■= {(> n il)|C n (> n i2) = C A (7 n (< n — \R)is inconsistent}, and 
Ac := {VR.D\CnyR.D = CAyD':CnyR.D' = C=>DrD'}. 

A concept C is in structural subsumption normal form (SSNF) w.r.t. C iff 
C = AnEnFnLnMn 

where E,F,D' is in SSNF w.r.t. E,F,D. SSNF{C) denotes the set of all con- 
cepts in SSNF w.r.t. C. 

Intuitively, for a concept C, the sets £c,Lc, and Ac contain the most specific 
3-quantifications, >-restrictions, and V-quantifications following from C which 
are relevant for structural subsumption computation but possibly not explic- 
itly represented as conjuncts in C. Thrae sets are used for the definition of a 
new concept C in which the additional 3-quantifications, >-restrictions, and 
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V-quantifications are made explicit. By construction it is clear that Eq, 
and are finite and, thus, C is well defined. In general, an SSNF of a con- 
cept C need not be unique. Therefore, we collect all concepts in SSNF w.r.t. C in 
the set SSNF{C). Later we will sketch an algorithm for computing one element 
of this set. As an obvious consequence of Definition 8, we can state the following 
proposition. 

Proposition 1. Let C be a concept and C € SSNF{C). Then C = C. 

Proof, (sketch) The proposition holds because C is a subexpression of C and 
the additional constraints in C are logical consequences of C. 

As a consequence of Proposition 1 and Definition 8, we get the following theorem. 



Theorem 1. LetC andD be concepts andC G SSNF{C) andD G SSNF{D). 
Then structural-subsumption returns true iff D QC. 

Proof (sketch) Theorem 1 holds since, by Proposition 1, we have C = C and 
D = D and in the concepts C and D, all 3-, V-quantifications, and >-restrictions 
relevant for structural subsumption computation are made explicit. 

When determining the subsumption relationship between two concepts C 
and D we first compute concepts C G SSNF{C) and D G SSNF{D) and invoke 
Algorithm 1 on C and D. Theorem 1 guarantees that the subsumption relation- 
ship Dec holds if and only if structural- subsumption{C,D'f returns true. In 
the next section, we will develop an algorithm for computing C G SSNF{C) for 
a concept C. 

4 Computing Structural Subsumption Normal Forms 

The preceding section showed that in order to compute a concept C in SSNF 
w.r.t. a concept C, we must make information explicit which is only implicitly 
contained in C. The difiiculty is to capture exactly those 3-, V-quantifications, 
and >-restrictions which are induced by every model of C. Therefore, the algo- 
rithm is closely related to the functioning of a tableaux prover for ALEN'R- [5]. 
Since tableaux provers aim at creating a constraint system representing only one 
possible model of C, our algorithm will create a finite set of (partial) constraint 
systems representing the set of all (partial) models of C. Each constraint sys- 
tem of this set induces a concept and, by considering the commonalities of these 
concepts, the sets and in Definition 8 can be obtained. In order to 

keep the presentation of the algorithm simple, we will only consider consistent 
concepts C with C ^ T. This can be done without loss of generality since sub- 
sumption between concepts C and D can easily be determined by Algorithm 1 
if C or D (or both) are equivalent to T or ±. 

We first introduce some helpful notation. Throughout the rest of this section, 
for a concept C in SNF given by (1), let 
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ec - ,3Ra.Cn),^c-.= {Vij'i.c;,... 

£c := {(> ni Si),... ,(> np 5p)},9nc := {(< mi S'l),... , (< m, 5^)}. 

We assume that there exists an alphabet of variable symbols. A constraint is a 
syntactic object of the form {x, R, C) where x is a variable, R isa role and C is a 
concept. Intuitively, (x, R, C) says that x has to be interpreted as an Il-successor 
in the interpretation of C. A constraint system CS is a finite, non-empty set of 
constraints. 

Definition 9. Let C be a concept in SNF given by (1) and Slclii := {V S.D € 
is a subrole of 5}. Then we define the function 

cs{C) := {(x, i2, D)| there exists 3 R.D in £c} U 

{(x, R,D)^,. . . , {x,R,D)^\ there exists (> r R) in £c and D = D\n 
■■■nDk, ifQlc\R = {\/Si.Di,... ,V5jt.Dfc},D = T i/2tc|fl = 0}. 

Intuitively, the function cs(C) returns the constraint system induced by <Bc and 
£c- Our intention is to successively modify cs(C) into a new constraint system 
CS' such that the sets £^, C^, and Aq can easily be derived from CS'. As 
the example in Section 3 shows, it may be necessary to “merge” constraints 
due to the existence of <-restrictions in C. We will model the merging process 
as a result of repetitive applications of a merging rule to cs{C) yielding a new 
constraint system CS'. 

For a concept C and a constraint system CS, we say that CS is compliant 
with Zc iff, for all (> n R) € Zc, there exist at least n constraints in 
involving subroles of R. CS is compliant with 2lc iff, for all V R.D € 2lc, there 
exists no constraint (x, S, D') in CS such that 5 is a subrole of R eind Dr\D' \s 
inconsistent. Finally, CS is compliant with Tic iff, for all (< nR) € Tic, there 
exist at most n constraints in CS involving subroles of R. 

Definition 10. Let CS — {{x,Ri,Ci), . . . ,(x,Rn,Cn)} be a constraint sys- 
tem and Zc and Tic be given for a concept C. Then we say that CS' emerges 
from CS by application of the 3-merging rule (CS — CS') iff 3k, k' € 
{!,... , n}, k ^ k' , such that 

(i) CS' = {(x, Ri, Ci)\i € {1, . . . ,n} \ {k. A:'}} U {(x, Rk n Rk>,Ck n Ck>)}, 

(a) CS' is compliant with Zc a,nd Zlc but not with Tic, 

(Hi) msr{Rk,Rk’) is defined, and 
(iv) Ck^Ck' is consistent. 

CS' emerges from CS by successive applications of the 3-merging rule iff3CSi, 
... , CSr, r e IN U {0}, such that CS = C^i — • • • -^a CSr = CS'. CS' 
is 3-merging rule complete w.r.t. CS iff CS' emerges from CS by successive 
applications of the 3-merging rule and -i3C5'" : CS' — >a CS". In this case, 
CS' is called an 3-merging rule completion of CS. Furthermore, we define the 
set of 3-merging rule completions of CS as 

Mqs '■= {CS'\CS' is an 3-merging rule completion ofCS}. 
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Intuitively, by applying the 3- merging rule to a constraint, we yield a new 
constraint in which the role components and the concept components of two con- 
straints are conjunctively combined, respectively. According to Definition 10, we 
can apply the 3-merging rule if the resulting constraint system is compliant with 
£(7 and 2 lc but not with Tic- For tbo now constraint to be able to represent the 
two merged constraints, the msr of the roles of the two merged constraints must 
be defined and the conjunction of their concept components must be consistent. 

Let us apply the 3-merging rule to our example concept Y. We have cs{Y) = 
{(x, R, A), (x, R, S)} and, by two alternative applications of — > 3 , we get M^a(Y) 
= {{(x, R,An B)}, {(x, R,BH A)}}. 

In the next step, we will show how to compute Sq and Aq (see Defini- 
tion 8 ) from for a concept C. We introduce two functions somes{CS) 

and alls{CS,Ttc) which extract the 3- and V-quantifications induced by the 
constraint system CS. 

Definition 11 . Let CS be a constraint system and (751^ := {(x, 5, C) e CS\S 
is a subrole of R} and 

Res ■= {3 R-D\{x,R,D) e CS} and 

Acs,<mc •= {Vi?.D[ there exists (< nS) in Tic such that jJC5|s = n with 
CS\s = (x,Ri,Ci),... ,{x,Rn,Cn)} and D = lcs{Ci, . . . ,C'„) 
and R = msr{Ri,... ,Rn)}- 

Then we define the following functions: 

somes{CS) nseEcsR 

alls(CS, Tic) nAeAos.ma (2) 

For each constraint (x, R, C) in CS, somes{CS) contains a conjunct 3 R.C. The 
function alls{CS, Tic) contains a V-quantification of the form V msr{Ri,. . . , 
Rn).lcs{Ci, . . . ,Cn) if there is an <-restriction in Tic involving a number n 
which is equal to the number of constraints (x, i?i, Ci), . . . ,{x,Rn,Cn) in CS 
involving a subrole of R. The set Acs,mc is well defined since the LCS operation 
used in (2) is applied to concepts of depth smaller than the original concept C. 
Hence, no cyclic definition is present. Considering Tly = {(< 1 R)}, we get 
somes({(x, iZ, A n B)}) = 3 iZ.(A D B),somes({(x,iZ, J5 fl A)}) = 3 i?.(B n 
A), alls{{{x, B, A n B)},TIy) = V R.{A fl B), and alls{{{x, R,B\1 A)}, Tly) = 
VB.(BnA). 

Each single conjunction SAi := somes{CSi) n alls{CSi,Tlc) characterizes 
properties of a specific class of partial^ models of the depth 0 components of C. 
Since we are interested in a characterization of the properties of all partial models 
of the depth 0 components of C, we compute the commonalities of SAi, . . . , iSA„ 
by means of the LCS operation defined in Section 2 . 

^ We use the expression “partied models” because atomic and negated atomic concept 
components are not considered in the constraint systems. 
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Definition 12. Let C be a concept, = {CSi,... ,C5„} be the set of 

3-merging rule completions of cs{C). Then we define 

Ji-completion{C) := lcs{somes{CS\) n alls{CSi,Wflc), ■■■ , 

somes(CSn) n alls{CSn, 

Since we compute the LCS of conjunctions of 3- and V-quantifications which 
do not interact with each other, the computation can be restricted to an LCS 
operation on concepts with depth smaller than the original concept C. Hence, no 
cyclic definition is present and 3V-completion(C) is well defined. Let us compute 
3V-completion(y) for our example concept Y. We have lcs(3i?.(i4nH)nVi?.(.4n 
B), 3 R.{B n A) R.(B n A)) = 3 R.(A D jB) fl V n B). We can now state 
the following proposition. 

Proposition 2. Let C be a concept. Then 3'i -completion{C) = fl 

Proof, (sketch) Each 3-merging rule completion CS of cs{C) induces a partition 
of the set of all partial models of the depth 0 components of C. For each CS, 
somes{CS) and alls{CS,^c) represent the 3- and V-quantifications induced 
by CS and 3Rc- Hence, 3'i-completion{C) represents the conjunction of 3- and 
V-quantifications induced by all 3-merging rule completions of cs{C) which is 
equivalent to E n A. 

From Proposition 2 it follows that, for a concept C, we can compute the sets £q 
and Aq in Definition 8 as follows. We first generate the constraint system cs(C). 
By successive 3-merging rule applications we construct the set of 3-merging 
rule completions = {CSi,... ,CS„} from cs(C). Then, for each CSi, 

we construct somes(CSi) H alls(CSi,Wlc)- Eventually, by means of the LCS 
operation, we can compute the 3- and V-quantifications in and A§. 

In order to compute in Definition 8, we make use of a merging rule similar 
to the one in Definition 10. Starting from a concept C, our intention is to obtain 
a set of constraint systems from which the >-restrictions in can be 

derived in a similar way than Sq and A§ can be obtained from Defi- 

nition 10 cannot be used for this since the merging process terminates as soon 
as the constraint system is compliant with VJtc- As explained by the following 
example, it is necessary to continue the merging process as long as the constraint 
system is compliant with Zc and 2tc- Let X := 3i?.Ain3i?.A2n3i2.A3n(< 2R). 
Then, cs{X) = {{x,R,Ai),{x,R,A 2 ),{x,R,A 3 )} and = {{(x,i2,Ai □ 

A2),{x,R,Az)},{{x,R,Ai n Az),{x,R,A2)},{{x,R,A2 n A3),(x,i?,Ai)}}. 
From it follows that X has at least two /?-successors. However, as can 

be easily seen, there are models of X with only one i?-successor. Therefore, we 
change the condition for stopping the merging process such that the termination 
criterion is independent on 9Jtc- 

Definition 13. Let CS — {(x, Ri,C\), ... , (x, Rn, Cn)} be a constraint system 
and Zc be given for a concept C. Then we say that CS' emerges from CS by 
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application of the >-merging rule (CS — »> CS') iff 3fc, fc' G { 1 , . . . , n}, k ^ k' , 
such that 

(i) CS' = {{x,Ri,Ci)\i G {1, . . . ,n} \ {A:, A:'}} U {{x,Rk □ Rk>,Ck H Ck')}, 

(a) CS' is compliant with £c with 2 lc, 

(Hi) msr{Rk,Rk') is defined, and 
(iv) Cfe n Cfe' is consistent. 

CS' emerges from CS by successive applications of the >-merging rule iffBCSi, 

. . . ,CSr, r G WU {0}, such that CS - CSi — >> CSr = CS'. CS' 

is >-merging rule complete w.r.t. CS iff CS' emerges from CS by successive 
applications of the >-merging rule and -i3CS" : CS' — >> CS". In this case, 
CS' is called an >-merging rule completion of CS. Furthermore, we define the 
set of >-merging rule completions of CS a.s 

M^s ■— {CS'\CS' is an > -merging rule completion ofCS}. 

As an example, after repetitive >-merging rule applications on cs(X), we get 
^cs(x) = {{{x,R,AlV^A2^\As)),{{x,R,Al^^A3V^A2)),{{x,R,A2nAlV^ 

A 3 )}{(x, r, A 2 n As n Ai)}, {(x, R, A 3 n Ai n Aj)}, {(x, R, A 3 n A 2 n Ai)}}. 

The final step for computing is similar to the one for computing £q 
and A^. 

Definition 14. Let CS be a constraint system, 91 := {i?|(x,i?, C) G CS}, 
CS\r := {(x, 5, C) G CS\S is a subrole of R}, and 

Lcs := {(> n iZ)|3![H' C 2 ^ with fH' = {ili , ... , Rt} and R = msr{Ri,. . . , Rt) 
is defined and n = 

Then we define the function: at-leasts{C S) := r\LeLcs^- 

The function at-leasts{C S) computes the >-restrictions induced by the con- 
straint system CS. For example, if CS = {(x, Rfl S, A), {x, RUT, B)}, we get 
at-leasts{CS) = (> 1{RU S)) □ (> 1 (fl n T)) n (> 2R). 

Definition 15. Let C be a concept, = {C5i,... ,CSn} be the set of 

> -merging rule completions of cs{C). Then we define 

> -completion{C) ;= lcs{at-leasts{CSi), ... , at-leasts{CSn))- 

With this definition we can state the following proposition. 

Proposition 3. Let C be a concept. Then, > -completion{C) = U^^^^L. 

The proof of Proposition 3 is analogeous to the proof of Proposition 2. We can 
now combine the results of Proposition 3 and Proposition 2. 

Theorem 2. Let C be a concept in SNF given by (1), 3^-completion(C) = 
BTi.EiU- ■ -UBTr.ErUyUi.Fin- ■ -WUs.Fs, compute-ssnf(C) := -completion{C) 
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n > -completion{C), and D :—3 Ti.compute-ssnf{Ei) □ • • • fl 3 Tr- 
compute-ssnf{Er) n'iUi.compute-ssnf(Fi)n- ■ ■n'iUs.compute-ssnf{Fs), and 

C := AnEnPnLnMnD with 
E = rii<i<„3 Ri.compute-ssnj{Ci) 

F — R^.compute-ssnj{C^). 



Then, C £ SSNF{C). 

Theorem 2 states that we obtain a concept C which is in SSNF w.r.t. C if 
we apply the procedure described in this section on every depth of C and on 
each 3- and V-quantification in 3N-completion{C). In our initial example, we 
get compute-ssnfiy) = 3 R.{A fl B) n V R.{A FI B) D (> 1 R) and, thus, C = 
C n 3 R.{A n S) n V R.{A n B) n (> l B) as desired. 

In this section, we presented an algorithm to compute a concept C € 
SSNF{C). We will now use this result for computing the LCS of two concepts. 

5 LCS Computation 

Given concepts C and D and C € SSNF{C) and D 6 SSNF{D^, lcs{C,D) 
can straightforwardly be implemented into an algorithm taking C and D as 
arguments. Algorithm 2 recursively computes lcs{C,D) with arguments C 

Algorithm 2 compute-lcs {C, D) 

if D QC then C else it C Q D then D 

else if (G = A V C = -lA) and (D = BW D = -iB) for atomic concepts A and B 
then if C = D then C else T 
else if C = 3 R.C' and D = 3 S.D' then 
if msr{R,S) is imdefined then T 
else 3 msr{R, S').compute-lcs(C', D') 
else if C = V R.C and B = V S.D' then V (B n B). compute-lcs (C', D') 
else if G = (> n R) and D = {> mS) then 
if msr{R, S) is undefined then T 
else (> mm{n, m} msr{R, S)) 

else if C = (< n R) and D = {< mS) then (< max{n, m} R H 5)) 
else if G = Gi n n Gn then rii<i<„compute-lcs(Gi, D) 
else if Z? = Di n • • ■ n Dn then compute-lcs(Z>, G) 
else T endif 

and D. The invocations compute-lcs{Xi,Y), i s {1,2,3}, yield respectively 
3 R.{A n B), V i?.(A n B), and (> 1 iZ) as desired. 

Theorem 3. LetC andD he concepts andC£ SSNF{C), andD € SSNF(D). 
Then compute-lcs(C, £>) returns a concept which is equivalent to lcs{C,D). 

Proof, (sketch) By Theorem 1 we know that subsumption between C and D can 
be computed structurally by applying Algorithm 1 to C and D. Now Theorem 3 
is a consequence of Theorem 3 in [3] which states that the LCS of concepts C 
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and D can be computed according to Algorithm 2 if, for each language con- 
structor, a least upper bound operation on the arguments of the constructor is 
present and the input concepts are in structural subsumption normal form. The 
least upper bound operations are implicitly given by Algorithm 2. 

6 Conclusion 

We have presented an LCS operator for the expressive DL ACSAfTZ. Computing 
the LCS for concepts is a very important inference service applicable to a num- 
ber of applications. This article contributes to recent research on extending the 
LCS to more and more expressive DLs. As shown in the literature, the LCS can 
be computed by a simple algorithm if, for each language constructor, a unique 
least upper bound operation on the space of the arguments of this constructor 
can be provided and the concepts are first transformed into a normal form with 
which subsumption between the concepts can be decided by a structural sub- 
sumption algorithm. The special challenge one faces is that interactions between 
different concept constructors imply relevant implicit information that must be 
made explicit. Based on the least upper bound operations, we extended the LCS 
computation algorithm for the description logics ACM and ACE to the new 
language constructs included in ACEMTZ. Future research should include the 
extension of the LCS operation to more complex DLs. Moreover, it would be 
interesting to extend the notion of concept commonalities to description logics 
including a disjunction operator. The LCS operator does not seem to be use- 
ful for this because the LCS of two concepts would just be their disjunction 
which we found to not express meaningful commonalities in our similarity-based 
information retrieval application. 
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Abstract. One of the key issues in Automated Theorem Proving is 
the search for optimal proof strategies. Since there is not one uniform 
strategy which works optimally on all proof tasks, one is faced with the 
difficult problem of selecting a good strategy for a given task. Strategy 
parallelism, where a proof task is attempted in parallel by a set of strate- 
gies with distributed resources, is a way of circumventing this strategy 
selection problem. However, the problem of selecting the parallel strate- 
gies and distributing the available resources among them still remains. 
Therefore we have developed a method for automatic strategy and re- 
source configuration based on the combination of a genetic algorithm and 
a gradient procedure. For the effective use of this method it is necessary 
to be able to automatically gather leurge amounts of experimental data. 
We present an environment for such large scale data collection that has 
been used by us in preparation of the CADE- 16 automatic system com- 
petition. In order to evaluate the potential of the method experimentally, 
we have implemented the strategy pau'allel theorem prover e-SETHEO. 
The experimental results obtained with the system already justify our 
approach while showing substantial potential for future development. 



1 Introduction 

Automated Theorem Proving (ATP) is the subfield of computer science dealing 
with the automatic verification of the validity of logical formulae. Attempting 
to prove the validity of such formulae automatically, particularly beyond simple 
textbook examples, typically results in a tremendously large search space. Such 
a search problem is usually solved by a uniform search procedure. In automated 
deduction, different search strategies may behave significantly different on a 
given problem. Unfortunately, in general, it cannot be decided in advance which 
strategy is the best for a given problem. This motivates the competitive use 
of different strategies, especially when the available resources are restricted. In 
order to be successful with such an approach, the strategies must satisfy the 
following two conditions. Sub-linearity: Let sol{s,t) denote the set of problems 
solved with a strategy s in time t. Then, for a typical set of problems, the 
function must be sub-linear, i.e., with each additional time interval fewer 

new problems are solved. Complementarity: The competing strategies must be 
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complementary w.r.t. a given problem set, i.e., the sets of problems solved in a 
certain time limit by two different strategies should differ significantly. If both 
conditions are satisfied, then a competitive use of different strategies can be more 
successful than the best single strategy. 

The selection of more than one search strategy in combination with tech- 
niques to partition the available resources in a manner dependent on the task 
defines the parallelization method strategy parallelism: distributed competitive 
agents attempt to solve the same problem, but with different strategies. It is 
intended that these strategies should traverse the search space such that, in 
practice, the repeated consideration of identical parts is largely avoided. In this 
paper we address some aspects of strategy parallelism, such as search space 
partitioning, strategy selection and resource allocation. We also describe the de- 
sign of a strategy parallel theorem prover and give some experimental results 
which justify our approach. According to our experiments, even simple forms of 
strategy parallelism can yield super-linear speedups. 

The paper is organized as follows. In the next section, we give a brief overview 
on Automated Theorem Proving. Section 2 relates strategy parallelism with 
other parallelization methods in automated deduction and discusses problems 
like partitioning, schedule computation, and scalability. Furthermore, in this 
Section we give an outline of the design and the implementation of our strategy 
parallel theorem prover e-SETHEO. Then, in Sections 3 and 4 we briefly describe 
our prover configuration method. Section 5 presents some experimental results 
obtained with this system. We conclude the paper with an outlook on future 
development and an assessment of our current work. 

2 Strategies and Strategy Parallelism, a Framework for a 
Strategy Parallel Prover 

For us, a strategy is one particular way of traversing the search space. We are 
now looking for a way of efficiently combining and applying different strategies in 
parallel. Many ways of organizing parallel computing have already been studied. 
However, many of these methods do not apply to automated theorem proving, 
since it is generally impossible to predict the size of each of the parallelized 
subproblems and it is therefore very hard to create an even workload distribution 
among the different agents. Here, we cite some of the successful examples. An 
example can be found in the nagging concept [SS94a]: dependent subtasks are 
sent by a master process to the naggers, which try to solve them and report 
on their success. The results are integrated into the main proof attempt. A 
combination of different strategies is used within the teamwork concept [Den95] 
of DISCOUNT [DKS97] for unit equality problems. These strategies periodically 
exchange intermediate results and work together evaluating these intermediate 
results and determining the further search strategies. A simple but effective 
combination of different theorem provers is applied in SSCPA [SS99]. Strategy 
selection techniques are applied even in systems with finite search spaces like 
EUREKA [CV98]. Partitioning of the search space [SS94b] is done. e. g. in 
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PARTHEO [SL90]. Some of these approaches are very good in certain aspects. 
Partitioning, for example, can guarantee that no part of the search space is 
considered twice, therefore providing an optimal solution for the problem of 
generating “significantly” differing search strategies. The fundamental weakness 
of partitioning, however, is that it needs a tight and extensive communication 
between the agents. Therefore, we have investigated a competition approach. 
Different strategies are applied to the same problem and the first successful 
strategy stops all others. However, not all strategies are equally promising or 
require equal effort. It is therefore advisable to divide the available resources in 
an adequate way. 

The selection of more than one search strategy in combination with tech- 
niques to partition the available resources such as time and processors is called 
strategy parallelism [WL99]. Different, competitive agents traverse the same 
search space via different, ideally non-overlapping paths. Such a selection of 
strategies together with a resource allocation for the strategies is called a sched- 
ule. One of the key problems of strategy parallelism is the strategy allocation 
problem. We capture this problem by using a set of training examples from the 
given domain and optimizing the admissible strategies for this training set. The 
training phase, however, is extensive, as can be seen from the following consid- 
eration. Given a set of training problems, a set of usable strategies, a time limit, 
and a number of processors, we want to determine an optimal distribution of re- 
sources to each strategy, i.e., a combination of strategies which solves a maximal 
number of problems from the training set within the given resources. Unfor- 
tunately, even the single processor decision variant of this problem is strongly 
NP-complete [WL99] . In practice, we therefore use suboptimal schedules whose 
generation will be described in Section 3. 

Using SETHEO as the basic underlying inference machine, we have developed 
p-SETHEO, a prototypical implementation of a strategy parallel theorem prover. 
This system in the meanwhile has been further developed into e-SETHEO, the 
most important improvements being the augmentation by the new E prover, a 
superposition calculus equality prover [Sch99] and the employment of PLOT- 
TER [WL99] as a conversion procedure from full first order logic to conjunctive 
normal form (CNF). The inclusion of E as an important strategy has cured 
SETHEO’s weakness in the equality domain. We have used e-SETHEO to col- 
lect experimental data and will participate with that system in the ATP system 
competition at the CADE- 16 conference. Our implementation had initially been 
written in C, Perl and PVM, and currently consists of about 1500 lines of code 
without the basic SETHEO prover. While in the early stages of development 
nearly all stages had been variants of SETHEO (that is, e-SETHEO would in- 
vocate different instantiations of SETHEO with different parameterizations), in 
the meanwhile we have begun to incorporate a much wider variety of differ- 
ent strategies covering special problem classes. Often, certain characteristics of 
a proof task that are generally easy to recognize imply a special treatment of 
this task with a special strategy. For example, if the problem is either ground 
or can be grounded, it is possible to apply a propositional theorem prover to 
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that problem, which is not only very fast but additionally implements a deci- 
sion procedure for this kind of problem, allowing fast detection of non-theorems 
as well. Watching the current development it can be said that e-SETHEO is 
moving from a parallel version of the SETHEO prover to a general platform for 
parallel theorem proving able to incorporate all state of the art ATP systems, 
provided these ATP systems adhere to minimum standards regarding issues such 
as resource allocation or input-output behavior. 




Fig. 1. Schematic view of the functionality of e-SETHEO. 



E-SETHEO performs its proof tasks in a number of distinguishable steps, as 
is depicted in Figure 1. The first step that is done upon invocation is problem 
analysis. The problem analysis provides e-SETHEO with the most important 
syntactic problem statistics such as the number of clauses and literals, the num- 
ber of variables or the number of equality predicates. Prom our experimental 
experience it became evident that the greatest improvements in prover perfor- 
mance are not to be gained in the inference machine (which in itself has been 
refined and optimized over a period of several years), but in the analysis of the 
proof task and the subsequent choice of strategies. This first step also performs 
the CNF conversion if necessary. 

This analysis is the basis for step two of the proof task, the strategy allocation 
by schedule selection. A schedule is defined for that proof task according to 
criteria based on the problem statistics. This schedule is chosen from a schedule 
set generated during the prover configuration, which is explained in Section 3. 
Basically, a schedule is a list of strategies with resources assigned to each strategy. 
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Each strategy is a list of steps to be performed by the prover system. But as 
certain subparts, such as basic preprocessing steps, can often be shared among 
the strategies that make up the schedule, the schedules can also be seen as 
a program execution tree or as a set of program execution trees. The selected 
schedule is then written to a configuration file. After the configuration, the final 
step is executed, the actual parallel prover runs. At that stage e-SETHEO hands 
over control to a utility called WRAPPER that executes the selected schedule 
and watches the progress of the parallel subtasks, terminating the entire schedule 
when a strategy has been successful. 



3 Prover Configuration 

The multitude of settings of the basic SETHEO inference machine and the con- 
siderable number of additional prover tools employed by e-SETHEO result in a 
vast number of different configurations in which the prover system can be used. 
It is obvious that it is not feasible to test all these possible configurations for 
their performance on a given problem domain. However, using heuristics, intu- 
ition and experience, a number of about one hundred of these configurations has 
been identified as potentially useful and implemented as strategies. Still, hav- 
ing a hundred diflFerent strategies to choose from (and to distribute resources 
among), trying to obtain an optimal solution for resource allocation would be 
futile. Therefore we choose our schedule from a number of pseudo-optimal solu- 
tions we acquire in a three-phased process. 

In the first phase, all given problems are divided into a small number of 
classes according to some very simple discrimination criteria, such as Horn or 
non-Horn formulae with or without equality. In the second phase, for each of 
these classes a set of schedules is evaluated by the genetic algorithm described 
in [SW99]. The best of these schedules are selected for refinement in the third 
phase by applying the gradient method explained in [Wol98b]. This process re- 
sults in a set of pseudo-optimal schedules, one for each class, that are then used 
for configuring e-SETHEO. It can easily be seen that this entire configuration 
process can be done automatically, the one task remaining for the user being the 
selection of a suitable subset of the problem domain to be used for training. The 
experiments described in the next section showed among other things that the 
genetic algorithm generates decent results in a very short time. Several refine- 
ments of the process could only slightly enhance the results, but could ensure 
a better convergence of these results. This is a direct consequence of the first 
required strategy property as postulated in Section 1. 



4 Data Generation 



We capture the schedule determination problem by using a set of training ex- 
amples from the given domain and optimizing the admissible strategies for this 
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training set using the genetic gradient method^ . Providing the necessary training 
data, however, is a very extensive task. Given a set of training problems, a set 
of usable strategies, a time limit t, and a number of processors, we want to de- 
termine a combination of strategies which solves a maximal number of problems 
from the training set within the given resources. To compute this combination, 
we have to determine for all admissible strategies S the solution times (within t) 
on all problems P from the training set. In our experiment, we employed 50 
workstations for some weeks CPU time to determine all necessary data on a 
training set of about 4000 problems and a set of 110 strategies. The worksta- 
tions were organized in a loosely connected cluster with shared disk memory. 
Additionally, we used a controller workstation separated from the cluster. 

We do not have exclusive access to the workstation cluster, and because of 
the large amount of work to be done and because of the (in general) consider- 
able resources required by theorem prover programs, the data generation system 
needs to have some means of balancing and limiting the load produced by our 
experimental setup, so the interactive users of the workstation cluster and the 
connecting network will not be needlessly encumbered. Therefore we employ an 
Performance Evaluator. This Performance Evaluator is responsible for the gen- 
eration and maintenance of a data base containing all data necessary for the 
evaluation of the expected performance and the expected free resources on all 
involved machines. We limit the number of prover processes running simulta- 
neously on the same processor as well as the maximum load and the minimum 
amount of free memory allowed before starting a new prover task. 

The Task Generator maintains a list of tasks to be treated. Using inquiries on 
the available strategies, on the available problems, and on the strategy-problem- 
pairs which already have been finished properly. Task Generator generates a To- 
do-List which is given to the Task Scheduler component. This Task Scheduler 
launches all the tasks from the To-do-List as soon as the Performance Evaluator 
provides usable hosts. If a certain task finished correctly, i. e., without errors 
caused by the operation system or other users, this fact and the generated data is 
recorded and used for the generation of the data matrices required for the genetic 
gradient algorithm as well as for providing re-entry points necessary when the 
whole data generation system has to be restarted, e. g., after re-booting the 
controller workstation or a general network failure. An abstract view of the data 
generatipn system can be seen in Figure 2. 

5 Experimental Results 

Our experiments were conducted in two phases. First we intended to verify the 
feasibility of the approach described in Section 3 with reduced problem and 
strategy sets. Therefore we used the 547 eligible TPTP problems [SSY94] of the 
theorem prover competition at the 15th Conference on Automated Deduction in 
1998 to be our training data set. Our participating prover p-SETHEO [Wol98a] 

^ The genetic gradient algorithm is the genetic algorithm followed by the application 
of the gr 2 idient procedure to the best individuals. 
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Fig. 2. Schematic view of the data generation system. 



system employed 91 different strategies, these formed our strategy set. We ex- 
tracted all these strategies and ran each strategy on all problems using the 
standard sequential SETHEO [MIL+97]. The successful results of all those runs 
were collected in a single list that became the database for our genetic gradient 
algorithm. 398 problems can be solved by at least one of the strategies in at most 
300 seconds. Then we ran the genetic gradient algorithm on the collected data. 
The success of each of the schedules, as the individuals of our genetic algorithm, 
was evaluated by looking up the list entries for the problems and respective 
strategies and time resources. 

In all experiments we used the gradient procedure and the genetic algorithm^ 
described above. The attributes of the initial generation that are selected at ran- 
dom strongly influence the overall results of the experiment. The deficiencies of 
an unfit initial generation can not be wholly remedied by the subsequent op- 
timizations. Therefore all experiments were repeated at least ten times. The 
curves and tables depicted in this section represent the median results. Figure 3 
shows the"number of problems solved after 0 to 100 generations for 10, 20, 40, 
and 160 individuals (numbers at the curves) in 300 seconds on a single pro- 
cessor system. The next Figure 4 displays the number of problems solved on 
a single processor system with 100 individuals after 0, 10, and 50 generations 
(numbers at the curves) in the time interval from 0 to 1000 seconds. The be- 
havior is compared with the best single strategy (denoted by 6s). Note, that 
the strategy parallel system proved 378 problems within 1000 seconds, the best 

^ The fixed parameters have been a kill-off rate of 0.6, a mutation rate of 0.1, and a 
mutation probability for each strategy of 0;2. 
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Fig. 3. Number of problems solved by the schedule resulting from the genetic 
algorithm depending on the number of generations for populations of 10, 20, 40, 
and 160 individuals. 




Fig. 4. Number of problems solved by the schedule resulting from the genetic 
algorithm depending on the consumed time on a single processor system with 
100 individuals after 0, 10, and 50 generations, compared with the best single 
strategy (bs). 
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processors 


1 


2 


4 


8 


gradient procedure (solutions) 


355 


361 


374 


382 


genetic algorithm (solutions) 


352 


369 


381 


388 



Table 1. Solutions found by the gradient and the genetic method on systems 
with 1, 2, 4, and 8 processors. 



single strategy only 214, that is 57% of the strategy parallel version. These 214 
problems have already been solved by the strategy parallel system after 25 (!) 
seconds. Figure 5 illustrates the number of problems solved for 100 individuals 




Fig. 5. Number of problems solved by the schedule resulting from the genetic 
algorithm depending on the number of generations for populations of 100 indi- 
viduals each on systems with 1, 2, 4, and 8 processors and a time limit of 1000 
seconds. 



and 0 to 100 generations with a timeout value of 1000 seconds on systems with 

1, 2, 4, and 8 processors (numbers at the curves). Finally, we compare the results 
of both (single) methods, the gradient method and the genetic algorithm on 1, 

2, 4, and 8 processors with 300 seconds each (see Table 1). We see the slightly 
better results of the gradient procedure on one processor. In all other cases, the 
genetic algorithm performs better. 

Our experimental results show only a poor scalability for our actual prover 
system. That is due to the very limited number of training problems. Further- 
more, many of the used strategies overlap one another (see [WL99]). Still, all the 
above figures indicated that the genetic gradient approach is extremely useful for 
automatically configuring a strategy parallel theorem prover, and therefore we 
started the data generation for the entire problem and strategy sets. We tested 
each of the 110 strategies on all 4004 problems of the latest version of the TPTP 
problem library [SSY94]. These test runs were conducted on the workstation 
cluster as described in Section 4 with a time limit of 300 seconds per problem 
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and strategy and took about two months; the results of these tests can be seen 
in the second and third columns of Table 2. Using the data from these runs we 
generated a set of pseudo-optimal schedules for the different problem classes. 
After having configured e-SETHEO with these schedules, we tested e-SETHEO 
on the TPTP problems again, the number of problems solved by the strategy 
parallel prover is shown in the fourth column of Table 2 with the fifth column 
giving the number of strategies contained in the respective schedules. Even if the 
relative increases gained by our process may not seem excessively high, it has 
to be noted that the TPTP library as a whole contains many trivial problems 
as well which can also be solved by brute force strategies in very short time. 
Additionally, as has been explained in Section 1, due to the highly sub-linear 
nature of strategy performance over time, large gains in the number of problems 
solved become ever more unlikely with increasing problem difficulty. So the re- 
sults obtained by our automatic prover configuration can actually be considered 
a success. 



Problem Class 


problems 

in 

class 


# solved 
by some 
strategy 


solved 
by best 
strategy 


# solved 
by 

schedule 


strategies 

in 

schedule 


Groundable 


753 


685 


410 


682 


4 


Unit Equality 


446 


365 


330 


341 


5 


Pure Equality 


132 


96 


85 


93 


6 


Horn w. Equality 


226 


183 


175 


183 


3 


Horn w/o Equality 


373 


293 


275 


291 


3 


non-Horn w/o Eq. (large) 


268 


98 


74 


96 


6 


non-Horn w/o Eq. (small) 


266 


227 


191 


227 


6 


non-Horn w. Eq. (large) 


841 


140 


78 


132 


9 


non-Horn w. Eq. (small) 


699 


445 


317 


416 


12 


TOTAL 


4004 


2532 


H^Q 


2461 


- 



Table 2. Results in number of proofs found for TPTP v2.2.0 (1 processor, 
300 seconds) 



6 Future Work: Beyond Syntactic Problem Classification 

In this section we want to address an issue that has not (at least up to now) been 
sufiiciently treated by ATP researchers. Strategy parallelism has eliminated the 
necessity to choose a suitable search strategy in advance. Yet, given the large 
number of different strategies or even only the comparably small number of dif- 
ferent schedules, it is still necessary to choose the proper way of solving the proof 
problem. In e-SETHEO, this choice is based solely on the syntactic character- 
istics of the problem clauses. In many cases this is a very efficient approach. 
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as in the case of groundable formulae, where the absence of function symbols 
strongly suggests grounding the formula and subsequently using a semantic tree 
procedure. Yet, there remain large formula classes without an apparent inner 
structure implied by syntactic characteristics. A typical example for such a for- 
mula class is the class of non-Horn problems with equality, the class containing 
the most general problems available. Different attempts to divide this class, such 
as according to the specific number of function symbols or predicates, have all 
failed as it turned out that syntactically very similar problems caused widely 
differing search space behavior. 

As the next step in our research we intend to tackle this problem using a 
genetic algorithm. Given a set of schedules (such as the ones we are currently 
using), we employ such a genetic algorithm to define a schedule selection function 
using the syntactic data as input. Such a function would be a weighted sum over 
the syntactic values, the coefficients the sub-terms forming the chromosomes. A 
set of such chromosomes representing the function as a whole would constitute an 
individual. Given these individuals, we can use the genetic algorithm to produce 
a function that provides a nearly optimal problem classification. 

7 Conclusion 

In this paper we have demonstrated how the performance of automated theorem 
provers can be improved by the introduction of strategy parallelism. Our exper- 
iments demonstrate that it is possible to significantly increase this performance 
even with a simple strategy allocation algorithm and a non-optimized set of 
available strategies. While in theorem proving the system developer or advanced 
user often can tune the system by a suitable selection of parameters, this is not 
possible if the theorem prover is to be integrated into a larger proof environ- 
ment like ILF. In this case the configuration of the theorem prover must be done 
automatically and strategy parallelism is a good solution of this problem. 

It can be said that our implementation of e-SETHEO has moved away from 
a parallel framework for SETHEO towards a generic tool for parallel theorem 
proving, able to incorporate practically any state-of-the-art theorem prover. 

In this paper we did not address a variety of issues of considerable impor- 
tance that will be the subject of future research: Often, strategies are successful 
for a certain problem class, like the use of a special prover for unit equality or 
ground problems. The identification of such features can make the selection of 
strategies more specific and hence more successful. Currently we identify such 
features by purely syntactic means. Can we advance to an improved way of 
feature detection that involves semantic analysis? The success of the selected 
strategies depends on the training set of problems used for learning about the 
efficiency of strategies. How do we obtain a training set which is representative 
for the considered domain of problems? The number of sensible strategies for 
SETHEO, which are successful and maximally orthogonal seems to be bounded. 
This restricts the scalability to large platforms of parallel processors. Can we 
find a systematic method for producing an arbitrary number of successful and 
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orthogonal strategies? It is very likely that in order to overcome this problem 
random elements such as the ones mentioned in Section 2 have to be employed. 
Furthermore, possible improvements of the strategy allocation algorithm should 
be investigated. Up to now e-SETHEO employs a non-communicating variety of 
cooperation. Yet it can be imagined that the strategies periodically report on 
their proof status, inference rate, host processor performance etc., information 
that could be used for online strategy evaluation. This might allow the replace- 
ment of badly performing strategies or the migration of strategies from badly 
performing processors. 

Finally, we have to face the objection that e-SETHEO is excessively tuned 
to perform well on the TPTP, a collection of mostly very theoretical problems. 
We admit to that. We think, however, that this is not the point. If ATP systems 
want to be of practical use in fields such as verification, then the adaption of 
the prover system to the respective problem domain will be a basic necessity. 
We have shown our ability to adapt in such a way in the case of the TPTP, and 
as our tuning and configuration mechanism is entirely generic, we are optimistic 
that we are able to adapt our system to perform reasonably well on any problem 
domain. 
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Abstract. Using semantics to guide automated theorem proving sys- 
tems is an under-utilised technique. In hneax deduction, semantic guid- 
ance has received only limited attention. This research is developing se- 
mantic guidEince for linear deduction in the Model EUmination paradigm. 
Search pruning, at the possible loss of some refutation completeness, 
and search guidance, are being considered. This paper describes 
PTTP-|-GLiDeS, a PTTP style prover augmented with a semantic prun- 
ing mechanism, GLiDeS. PTTP-l-GLiDeS combines a modified version 
of Stickel’s PTTP prover with the model generator MACE. 



1 Introduction 

Automated theorem proving (ATP) aims to use computer technology to solve 
problems that require logical reasoning. Applications for ATP systems include 
logic circuit design validation, software verification, mathematical and logical 
research [18]. 

Resolution [9] was developed in 1965 and has formed the basis of much of the 
research undertaken in the field since. Resolution uses ‘proof by contradiction’ 
in its search for a proof. Assumptions (axioms) about the problem and the 
negated conjecture are expressed in the clause normal form of first order logic. 
The naive resolution approach takes the input clause set, So, and generates the 
set of all possible resolvents, Rq. If the empty clause is a member of Ro> a 
contradiction has been found and the problem solved. Otherwise, a new set Si 
is created. Si = So URq, and the process continues. If set contains the empty 
clause, Sn forms the search space for a minimal length proof. A large search 
space means the time taken to find a proof can be long. In order to find proofs 
quickly, both the size of the search space and the path the prover takes through 
the search space needs to be controlled in an intelligent manner. 

To control the search of a resolution based system, ordering and pruning 
strategies are used. Ordering strategies control the order in which resolvents are 
generated by giving preference to certain clauses and literals. Pruning strategies 
prevent certain combinations of clauses and also discard clauses, preventing them 
from taking any further part in the deduction. While ordering strategies attempt 
to guide the search along paths that may be more likely to produce the empty 
clause, pruning strategies reduce the search space. 

N. Foo (Ed.): AI’99, LNAI 1747, pp. 244-254, 1999. 

(c) Springer- Verlag Berlin Heidelberg 1999 
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Search control may utilise syntactic or semantic methods. Syntactic methods 
use some physical feature of the clauses to determine which clauses will be re- 
solved together and on which literals. Semantic methods use interpretations to 
give information about the clauses. This information is then used in choosing the 
clauses and literals to resolve. Semantic methods have the potential to perform 
much better than syntactic ones [16]. Semantic search control for forward chain- 
ing resolution strategies has been in use for some time. Set of support (SoS) [17], 
model resolution [6] , and semantic resolution [10] are all forward chaining resolu- 
tion strategies. Two systems that employ semantic guidance are CLIN-S [2] and 
SCOTT [11]. SCOTT is a resolution based prover that uses an interpretation to 
weight clauses and thus give preference to those clauses that are FALSE in the 
interpretation. CLIN-S is an instantiation based prover, and uses an interpre- 
tation to guide the generation of ground clauses, which are then examined for 
unsatisfiablity. 

Incorporating semantic methods into backward chaining resolution strategies 
is not as easy as for forward chaining ones. Semantic guidance for linear-input 
resolution is well understood [1], as described in Section 3. This research aims 
to incorporate semantic guidance into general linear resolution [5,7], primarily 
considering pruning strategies. 

The next section contains a brief explanation of the Model Elimination (ME) 
paradigm and introduces some terminology. Section 3 describes the architecture 
of PTTP-t-GLiDeS and explains the way in which semantic guidance has been 
incorporated into the ME based system. Implementation and performance are 
discussed in Sections 4 and 5 respectively. Further enhancements are being ex- 
plored and these are outlined in Section 6. 

2 Model Elimination 

ME is a chain format linear resolution procedure for first order logic, first pro- 
posed in [4] . A chain is an ordered list of A- and B-literals, with the disjunction 
between the literals being implicit. Chains generated from the input clauses are 
called input chains and are composed entirely of B-literals. The chains that form 
the linear path in the refutation are called the centre chains. The input chains 
that are resolved with the centre chains are called the side chains. A-literals 
are those literals in a centre chain that have been resolved upon. A-literals are 
indicated by a frame, e.g., |^. The first centre chain is called the top cheiin. One 
of the input chains is chosen to be the top chain; a chain generated from the 
negated conjecture is the usual choice. All input chains are potential side chains. 

In ME there are two deduction operations, extension and reduction, and one 
book-keeping operation, truncation. The extension operation is a binary resolu- 
tion between a centre chain and a side chain. The resolution takes place between 
the rightmost B-literal in the centre chain and a complimentary (after unifica- 
tion) B-literal in the side chain. The B-literal in the centre chain then becomes 
an A-literal, and the B-literal in the side chain is removed. The remaining B- 
literals in the side chain are added to the right of the newly created A-literal in 
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the centre chain. A reduction operation is a unification between the rightmost 
B-literal in the centre chain and an A-literal. The new centre chain is formed by 
removing the B-literal. Reduction implements ancestor resolution and factoring. 
1>uncation is the removal of A-literals from the right-hand end of a centre chain. 
See Figure 1 for an example of an ME refutation. 



pq 




prmi -q 



pqh 

pH 




Pi q 



-p 

j 



q 

-p 



00 

A 



extension 

extension 



reduction 



truncation 



tmncation 



extension 



extension 



reduction 
truncation x 2 



Fig. 1. An ME refutation of the set { p V q, p V ~q, ~p V q, ~p V ~q } 



One method of implementing ME is using the Prolog Technology Theorem 
Prover (PTTP) [12] principle. The idea here is to have the theorem prover rewrite 
the input clause set into Prolog procedures that implement ME deduction for 
the clauses. The procedures are then compiled and executed on a Prolog engine 
(see Figure 2). Prolog is based on linear-input deduction for Horn clauses and 
has an incomplete search strategy and unsound unification algorithm. A PTTP 
style system overcomes these issues by using a bounded depth first search and 
unification with an occurs check. 
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Fig. 2. Architecture of PTTP- based ATP systems 



3 Architecture 

The PTTP+GLiDeS semantic pruning strategy is based upon the strategy that 
can be imposed on linear-input deductions, as follows: If there exists a linear- 
input refutation, the last centre clause is the empty clause. The empty clause has 
the interpretation of FALSE in every interpretation. A FALSE resolvent must 
have one or more FALSE parents. If there is a model M of the side clauses, 
then this implies that the second last centre clause must be FALSE in M, and 
so on up to the top clause. So, if the side clauses are known and a model of 
them, M, can be found, then any centre clause that is TRUE in M can be 
rejected. A simple possibility is to choose a negative top clause from a set of Horn 
clauses, in which case the non-negative clauses are the potential side clauses. 
More sensitive analysis is also possible [3,14]. Linear-input resolution is complete 
for Horn clauses only. 

Unfortunately, the extension of the linear-input semantic pruning strategy to 
linear deduction is not direct. For the non-Horn case, ancestor resolution is re- 
quired for refutation-completeness. The possibility of ancestor resolutions means 
that centre clauses may be TRUE in a model of the side clauses. Investigation of 
how to allow for centre clauses that are TRUE in the model of the side clauses 
is a focus of this research. 

In PTTP+GLiDeS, rather than placing a constraint on entire centre clauses, 
a semantic constraint is placed on selected literals of the centre clauses as follows: 
The input clauses other than the chosen top clause of a linear deduction are 
named the model clauses. In a completed linear refutation, all centre clause 
literals that have resolved against input clause literals are required to be FALSE 
in a model of the model clauses. TRUE centre clause literals must be resolved 
against ancestor clause literals. This leads to a semantic pruning strategy for 
ME deductions that at every stage requires all A-literals in the deduction so far 
to be FALSE in a model of the model clauses. The result is that only FALSE 
B-literals are extended upon, and TRUE B-literals must reduce. 
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The completeness of the PTTP+GLiDeS semantic pruning strategy has not 
yet been investigated. It is certainly possible that it is an incomplete strategy. 
However, the results shown in Section 5 suggest that there is not a ‘large loss of 
completeness’, while the benefits are significant. 

Figure 3 shows the architecture of the PTTP+GLiDeS system. 
PTTP+GLiDeS uses a Prolog technology theorem prover to compile the input 
clauses into Prolog code, which is then run on a Prolog engine. An interpretation 
generator takes the model clauses from the input clause set and generates an 
interpretation which is also given to the Prolog engine. The Prolog code uses the 
interpretation to implement the semantic guidance. 



Interpretation 

Generator 



Input 

Clauses 
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PTTP+GLIDeS 




Compiled 

Prolog 

Clauses 


Interpre- 

tation 




Prolog Engine 
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Fig. 3. Architecture of the PTTP+GLiDeS system. 

4 Implementation 

PTTP+GLiDeS consists of modified versions of PTTP (Prolog version 2e) [13] 
and MACE (vl.3.2) [8], combined by a csh script. PTTP+GLiDeS takes prob- 
lems in TPTP [15] format as input. The tptp2X utility is used to transform 
the input problem to PTTP and MACE formats. The transformation to PTTP 
format selects the first conjecture clause as the top chain for the linear deduction. 

A perl script is used to remove the first conjecture clause from the MACE 
format file, and MACE is called to generate a model of the remaining clauses. 
MACE is capable of generating many models, but in this experiment the first 
model generated is used. If MACE is unable to generate a model then 
PTTP+GLiDeS terminates. Otherwise MACE outputs its model in the form 
of Prolog facts, e.g., 

aval (f imctor , a , 0) . 

aval (predicate, p(0,0) .true) . 

The modified PTTP is then started. 
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The modified PTTP produces Prolog procedures that i) maintain a list of all 
A-literals that have been produced in the deduction so far, and ii) call a seman- 
tic checking procedure after each extension and reduction operation. The facts 
produced by MACE are used to interpret the A-literals. If the semantic checking 
procedure finds an A-literal that is TRUE then the extension or reduction is 
rejected. 



5 Performance 



Testing has been carried out using 541 “difficult, unbiased” problems from the 
TPTP library v2.1.0. The testing was done on a SUN sparc20, with a CPU time 
limit of 600 seconds. Table 1 gives an overall summary of the results. 



Total number of problems: 


541 




Number of models generated: 


260 




Number of problems solved from 260: 


PTTP 


PTTP+GLiDeS 




68 


54 


Number of useful models generated: 


144 




Number of problems solved from 144: 


PTTP 


PTTP+GLiDeS 




21 


19 



Table 1. Summary of experimental data. 



For PTTP-hGLiDeS, MACE produced models for only 260 of the 541 prob- 
lems, and thus PTTP-t-GLiDeS could attempt only those problems. Of the 
260 problems for which models were generated, plain PTTP solved 68 and 
PTTP+GLiDeS solved 54. Altogether, there were 69 problems that had models 
generated and were solved by either system. Of the 260 models, only 144 proved 
to be useful in that they provided guidance that pruned the search space of 
PTTP+GLiDeS. Of these 144 problems, PTTP solved 21 and PTTP+GLiDeS 
solved 19. In total, there were 22 problems that had useful models generated 
and were solved by either system. 

For the 22 problems solved, Table 2 shows the CPU times taken, the num- 
ber of inferences made, and the number of inferences rejected during the search. 
The “CPU time” column for PTTP+GLiDeS includes the time taked for pre- 
processing the MACE input file to exclude the choosen top clause leaving only 
the model clauses, model generation and output time, writing of the Prolog pro- 
cedures, and the Prolog search time. For PTTP, the CPU time includes the time 
for writing the Prolog procedures, and the Prolog search time. The “Inferences” 
columns give the total number of extension and reduction operations performed 
during the search for a solution. The “Rejected Inferences” are the numbers of 
inference operations that were rejected by the semantic pruning routine. The 
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“Inference Ratio” shows the number of inferences made by PTTP+GLiDeS rel- 
ative to PTTP. 

The number of inferences made during the search gives an indication of the 
search space being covered during the search. A smaller inference count on the 
same problem does not necessarily indicate that the proof itself was any smaller. 
Instead, it shows that less of the search space was covered before the proof was 
found. 



Table 2. Results for problems where semantic guidance rejected some inferences 



Problem 


PTTP 

CPU time Inferences 
(sec) 


PTTP-i-GLiDeS 
CPU time Inferences Rejected 
(sec) Inferences 


Inference 

Ratio 


B00004-1 


11.0 


10515 


19.5 


9355 


365 


0.89 


B00012-1 


392.8 


1579178 


TIMEOUT 






CATOOl-4 


108.5 


427522 






37716 


0.89 


CAT002-4 


23.9 


84480 


87.6 




m 




CAT003-3 


TIMEOUT 




217996 


In 




CAT003-4 


8.6 


11077 


16.7 




585 




CAT012-3 


84.5 


175367 


49.1 




4124 




CAT018-1 


73.0 


226900 


343.2 


183518 






GRP012-3 


362.8 


1282139 


TIMEOUT 






HEN003-3 


17.2 


47136 




44322 




0.94 


HEN008-1 


17.7 




65.9 


69959 




0.97 


HEN008-3 


7.7 


11524 


17.7 






0.95 


HEN012-3 


25.2 


85312 






1857 


0.94 


PUZ032-1 


15.4 


26947 


19.8 




4427 


0.69 


RNG002-1 


13.7 


27867 


36.9 


27313 


756 


0.98 


RNG003-1 


14.3 






23867 


1412 




RNG040-1 




563 


11.9 


533 


67 


0.95 


RNG041-1 


11.1 


8826 


16.8 


4859 


824 


0.55 


R0B016-1 




4546 


22.8 


3738 


92 


0.82 


SET008-1 


7.4 


276 


9.2 


370 


56 


1.34 


SYN071-1 


411.8 




53.3 


84908 


27653 


0.10 


SYN310-1 




1476442 


TIMEOUT 






Average 


87.25 254354.80 


91.54 71545.11 


7868.83 


0.82 



PTTP-|-GLiDeS solved one problem, CAT003-3, that PTTP failed to solve, 
but timed out on three that PTTP did solve. In all but one case PTTP-|-GLiDeS 
took less inferences than PTTP, and in many cases significantly less. The times 
taken by PTTP-|-GLiDeS are higher than for PTTP in most cases. Two inter- 
esting cases to note are CAT012-3 and SYN071-1. These are non-Horn problems, 
and have the best reduction in inference counts and less CPU time than PTTP. 
Of the 22 problems, 7 are non-Horn and it is in these cases that PTTP-|-GLiDeS 
performs best on average for CPU time and inference counts, as shown in Table 3. 
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Table 3. Results for non-Horn problems 



Problem 


PTTP 

CPU time Inferences 
(sec) 


PTTP+GLiDeS 
CPU time Inferences Rejected 
(sec) Inferences 


Inference 

Ratio 


CAT003-3 


TIMEOUT 


230.2 


217996 


34840 




CAT012-3 


84.5 


175367 


49.1 


49150 


4124 


0.28 


PUZ032-1 


15.4 


26947 


19.8 


18629 


4427 


0.69 


RNG040-1 


9.0 


563 


11.9 


533 


67 


0.95 


MG041-1 


11.1 


8826 


16.8 


4859 


824 


0.55 


SET008-1 


7.4 


276 


9.2 


370 


56 


1.34 


SYN071-1 


411.8 


832600 


53.3 


84908 


27653 


0.10 


Average 


89.87 


174097 


55.76 


53778 


10284 


0.65 



There were 15 problems out of the group of 260 problems that were solved by 
PTTP and not PTTP+GLiDeS. Of these, 12 were Horn problems that had mod- 
els generated where the positive literals were TRUE, i.e., the models were trivial. 
PTTP+GLiDeS performed badly on these 12 problems as semantic checking was 
done when it was not going to have any effect on the search for a solution other 
than to slow its progress. It is a simple matter to check for this situation and 
omit the semantic guidance. Future implementations of PTTP+GLiDeS will 
have this feature. Of course, a better solution is to not generate trivial models 
in the first place. MACE is capable of producing many models for a given set 
of model clauses. In this experiment, the first model generated was used for the 
interpretation. A better approach may be to generate more than one model and 
select the ‘best’ one for use, or at least select a non-trivial model, as discussed 
in Section 7. 



6 Using Multiple Models 

Work is currently under way on two different multiple model versions of 
PTTP+GLiDeS. Version 1 generates several different models for a problem and 
perform the semantic checking using all models, i.e., the A-literals must be ac- 
ceptable to all models before an inference operation is accepted. By using more 
than one model it is hoped that greater pruning will be achieved. Preliminary 
testing has shown that while some extra pruning is achieved, the time taken 
to perform the semantic checking is greatly increased. For this approach to be 
practical a much more efficient implementation of the semantic checking routine 
needs to be written. 

Version 2 runs PTTP+GLiDeS in parallel with different models; the first 
one to find a solution kills the others. It has been observed that in some cases 
one model results in a timeout and another produces a solution for the same 
problem. By running in parallel with different models, it is hoped that one of 
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the models will be a ‘good’ model and produce a solution. It may also assist in 
overcoming any incompleteness problems. 

Table 4 shows data for PTTP+GLiDeS using the different multiple model 
versions and the data for PTTP. The models were hand coded rather than gener- 
ated by MACE. The parallel approach of version 2 was simulated - 
PTTP-fGLLDeS was run with each of the eight different models, then the best 
time was selected and multiplied by eight. The number of inferences was also 
multiplied by eight, but the rejected inferences count was not. 



Table 4. Results for PTTP and multiple model versions of PTTP-|-GLiDeS 



Problem 



LCL007-1 

LCLOlO-1 

LCL118-1 



PTTP 

CPU Inferences 

time 

(sec) 




8 Models 
Version 1 
CPU Inferences Rej.j 
time Inf. 

(sec) 



4.3 

672.4 

532.9 



3 

1321 

1392 



0 

11 

64 



8 Models 
Version 2 
CPU Inferences 
time 
(sec) 

34.4 M 

374.4 10584 

327.2 11136 



Rej. 

Inf. 



0 

7 

64 



Version 1 can produce greater pruning than using a single model. However, 
this is at the cost of greatly increase CPU time, greater than for version 2 where 
the best CPU time was multiplied by 8. 

Version 2 is of no benefit when all models result in solutions being found, as 
in the problems shown in Table 4. Its usefulness is more apparent in cases where 
one of the models does not produce a solution but another does. In such a case, 
version 1 would timeout but version 2 would not. 

7 Conclusion 

The preliminary experiments are encouraging. In the cases where both PTTP 
and PTTP-l-GLiDeS find a solution, PTTP-l-GLiDeS makes fewer inferences 
on average. This indicates that the semantic guidance is successfully pruning 
the search space. A side effect of the pruning may be a loss of refutation com- 
pleteness. Further work needs to be done to assess the extent of this. The time 
taken by PTTP-bGLiDeS is greater than PTTP in the majority of cases where 
both systems find a solution. It may be possible to improve the time taken by 
PTTP-l-GLiDeS by making the semantic checking code more efficient. 

Currently the tptp2X utility chooses the first conjecture clause in the problem 
as the top centre chain for PTTP. The failure to generate models in some cases 
is due to this unintelligent selection of the top chain. The performance of both 
PTTP and PTTP-l-GLiDeS are likely to be improved if this selection is done 
more intelligently. In particular, it is hoped that MACE will be able to produce 
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models for many more problems, hence giving PTTP+GLiDeS an opportunity 
to attempt more problems. 

Trivial model generation is another problem that needs to be overcome. It 
is possible to examine the model generated by MACE and determine if it is 
unlikely to be of use, as in the case of a trivial model, and reject it. It would 
be preferable to prevent generation of such a model in the first place. Further 
examination of this issue is needed. 

Using multiple models should enable PTTP+GLiDeS to achieve greater 
pruning of the search space. Preliminary results show that this is the case but 
the increase in CPU time is currently too high. 
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Abstract. Backpropagation, (BP), is one of the most frequently used practical 
methods for supervised training of artificial neural networks. During the 
learning process, BP may get stuck in local minima, producing suboptimal 
solution, and thus limiting the effectiveness of the training. This work is 
dedicated to the problem of avoiding local minima and introduces a new 
technique for learning, which substitutes gradient descent algorithm in the BP 
with an optimization method for a global search in a multi-dimensional 
parameter (weight) space. For this purpose, a low-discrepancy LP^ sequence 
is used. The proposed method is discussed and tested with common 
benchmark problems at the end. 

Keywords - Neural networks, NN learning. 



1. Introduction 

Backpropagation [14] is still one of the most widely used methods for supervised 
training of neural networks by minimizing an error function {cost), that represents 
the difference between the desired and obtained output, with respect to the weights of 
the network. Because the different techniques {gradient descent, line search, 
conjugate gradients, Newton's and quasi-Newton’s) used in BP depend strongly on 
the initial conditions [2, 9, 12], there is always danger of getting stuck in local 
minima during the learning (theoretically even in a saddle point [1]) and this way 
obtaining sub-optimal training [2, 8, 19]. Bianchini et al. [2] reviewed some 
theoretical contributions to the optimal learning, focusing on the problem of local 
minima. Another useful survey of the problems of supervised and unsupervised 
learning is given in [1]. 

One conventional approach is to consider the learning as an error function 
minimization problem. For this reason, investigation of the structure and landscape 
of the error surface is a related topic of continuous interest. Its detailed analysis and 
study [7, 11, 18] could reveal features and properties of the error function behavior, 
thus giving orientation where and how to search for a global minimum, and helping 
to define local minima free conditions [2, 8, 10, 17, and 19]. 

N. Foo (Ed.): AI'99, LNAI 1747, pp. 255 - 267, 1999. 
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In summary, a general viewpoint can be made that convergence of the BP 
learning algorithms is guaranteed only in the vicinity of a minimum, and during the 
iterative process of error function minimization, can become entrapped in wrong 
extreme, thus providing suboptimal learning. 

In this work, a method for modified BP supervised learning is reported, in which 
the BP backward pass (adjusting the weights) is substituted with a technique for 
global search in the weight space, using LP^ low-discrepancy sequence of points. 



2. LP^ LP^ Search 

We use an efficient technique for searching in a multi-dimensional bounded space, 
which uses the LP^ low-discrepancy sequence of points. Comprehensive description 
of this method can be found in [16]. Here we describe it very briefly. 



2.1 Uniformly Distributed Sequence of Points in an Arbitrary Domain G 



Let Pj,/^,...,P., ... be a sequence of points belonging to a unit n-dimensional cube 

C". If G denotes an arbitrary domain in C", with volumes > 0, and SAG) 
denotes the number of points /], Is isiV, i]GG, then the sequence of points P, is 
said to be uniformly distributed in C" if 



lim 

N-*<» 



Sn(G) 

N 



= Vo. 



( 1 ) 



It is obvious from (1) that when N is large enough, the number of points of a given 
sequence from an arbitrary domain G is proportional to its volume, SAG)~NV^. 
Fig.l shows two different uniformly distributed sequences [16]. 



2.2 LP^ Net in a Multi-Dimensional Unit Cube 

Any point jF] firom an n-dimensional unit cube C" (Fig. 1, n=2) has Cartesian co- 
ordinates Pi = (x;i, Xi 2 ,--, Xi„)f that satisfy the inequalities..., P(Pjv) 0 s Xjj si, 
for j -1,2, n, and i-l,2,...,N. It is usually assumed that a cubic net of 

N = M" points (Fig. 1(a)) best represents such a cube. Nevertheless, this is true only 
in the linear case, when n=l, and with the increase of n, its "uniformity" decreases 
quickly. To prove that, let us compare the 2D nets from Fig. 1. They both consist of 
16 points and any of the 16 small squares contains one and only one point of a net. It 
seems that both nets have nearly the same uniform distribution. However, this 
impression will be changed if we consider a function f{x-i,X 2 ), defined in C^, which 
is extremely sensitive to small changes in one of the variables (xj), while large 
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changes in the other one do not affect a significant change in the function value. In 
the extreme, if we assume / = /(xj), and calculate the function for all points from 
the cubic net (Fig. 1(a)), we would receive four different values only, each of them 
repeated four times. If we do the same with the points from the LP^ net (Fig. 1(b)), 
we would receive 16 values that give much better description of the function 
behavior in that domain. In a multi-dimensional case (which is more common, when 
function as /(xj,X 2 , ..., x„) depends strongly on some variables and weakly on the 
rest), the distribution of the cubic net becomes worse, because the lost information 
increases. 

Now, if we go back for a moment to the neural networks, the cost is a function of 
the weights and it is well known that there is a redundant information, contained in 
the weights of a fully connected network [7, 11]. Different pruning algorithms are 
used for eliminating the "weak" weights, for which the cost is less sensitive. As will 
be seen later, the use of the LP^ sequence for searching the error surface could be 
very effective in cases, where we do not know a priori how many of the parameters 
(weights) act strongly on the optimized function {cost), and how many act weakly. 




(a) Cubic net (b) Better (LPr) net 

Fig. 1. Two nets (n=2, M=4, N=16) 



In the next section we bound the weight space by a hyper parallelepiped II . In order 
to transfer the co-ordinates of the uniformly distributed points in C" to those in II , 
we give the following two lemmas (their proofs can be found in [16]). 

Lemma 7: If a sequence of points Q, with Cartesian co-ordinates (q, i, ..., g, „) is 

uniformly distributed in C" , then a sequence of points , 1 s i s IV with Cartesian 
coordinates (w,!, ..., W;„), where 

= w] + {w] - w’j ) q^ j , y = 1, 2, . . ., n (2) 

is a uniformly distributed sequence in parallelepiped II, whose Cartesian co- 
ordinates of points satisfy the inequalities 

Wj s Wi j s Wj , (3) 

where, as in (2), the superscripts ’ and " denote lower and upper bounds respectively. 
Lemma 2: Let Wj, ..., w,, ... be a sequence of points uniformly distributed in II, and 
Gcn be an arbitrary domain with volume Vg > 0 . If among the points w, we 
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choose those belonging to G, we obtain a sequence of points uniformly distributed 
in G. 

The LP^ sequence of points is one of the best low-discrepancy sequences known 
so far (for example, comparing with an r-nary LPg sequence, a (r,s) - sequence in 
base b [4], or a sequence [16]). Below we give three examples of effective use of 
this sequence when searching for a global minimum (GM) difficult to optimize 
functions (with many local minima (LM), plateaux and very steep regions). 

Usually, an optimization problem is defined as minimization (maximization) of 
objective function(s) F(w) with respect to a vector (w), which components are n 
unknown design variables, or parameters w,, Isjsn. The parameters are usually 
bounded by physical realizability criteria, so that the optimization is subject to 
constrains (3). These bounds determine an n dimensional hyper parallelepiped n" in 
the parameter space. In this parallelepiped we generate LP^ sequence of points 
Pi, P 2 , Pff, using formulae (2) and (3) (for calculation of (2), so called direction 
constants given in a table from [16] are used), and from the values 
F(Pi), ^(Pj)) •••> found the minimum one 

P(P). min P(/].). (4) 

We assume that P(P?) - minP(M’), P -w, where 

F(w)= min F(w) , (5) 

wSl" 



and P(m') is defined and continuous in II". The convergence of this search, that at 
least one point of the sequence will happen to be in “enough small vicinity” of a GM 
is proved in [16]. In fact, it is proved that when ./V -» <» , the number of testing points 
from the vicinity of a global minimum is much greater. 

Below, we list three functions that we used to test the LP^ search for finding 
aGM. 

Function 1. Two-dimensional Waves test function (many LM and one GM (5, 2)): 
/(x,y) - 1- sin(z)/z, z--^(x-S)^ +(l'-2)^ +£. 

Function 2. Two-dimensional Shubert test function (760 LM and 18 GM); 

It' 

Function 3. Two-dimensional Rastrigin test function (50 LM and 1 GM (0, 0)): 
fix, y)-x^ + y^ - cos(18x) - cos(18y) . 

In Table 1 relevant parameters of the carried out test search for a GM of the listed 
above functions are given (with 2^* LP^ points for all intervals). The optimization 
was done in several passes, which means starting with initially “wide” search 
intervals (3), and subsequently narrowing the intervals before each successive pass. 
Bottom three rows of Table 1 shows three search intervals and co-ordinates of three 
successive points found in the vicinity of a GM for the third function. For the other 




f(x,y) = i cos[(i -f- l)x: -hi] 
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two functions just the result of the last pass is given. The search interval (third 
column) is the same for both parameters. Obtained GM values for each function 
show very good agreement with the results given in [5], where they use the same 
functions to test their subenergy tunneling method (TRUST) for fast global 
optimization. 

Table 1. GM found for the test functions 



Function 


GM found 


Interval 


Waves 


(5.0008, 2.0005) 


[0.0, 10.0] 


Shubert 


(-7.7069, -7.0841) 


[-10, 0] 


Rastrigin 


(0.01525, 0.01525) 


[-500, 1000] 




(0.00152, 0.00152) 


[-50, 100] 




(0.00015, 0.00015) 


[-^10] 



3. Network Learning 

Despite the variety of gradient descent methods used in BP learning, it still holds 
two major weaknesses - slow convergence and presence of local minima. In this 
paper we concentrate mainly on the problem of avoiding local minima. One 
conventional approach, adopted in [16], is to consider learning as a surface-fitting 
problem. We consider learning as a non-linear optimization problem in which the 
goal is to find a set of values for the network weights which minimizes an objective 
function that represents the deviation between the actual and desired network output. 

The network under consideration is a fully connected (between adjacent layers) 
static network, consisting of L layers denoted with index I, 0^1 sL, where for the 
input layer 1=0, for the hidden layers 0<1<L, and for the output layer l=L. Each layer 
of the network consists of units which number is denoted by and each unit in layer 
/ is indexed with i), U = 1 , 2 ,..., «/. 

We assume jc/ and F/, p=l,2,...,P to be the input and target patterns 
respectively. For a given pattern p, the activation of a unit i fi’om the /-th layer, is 

af =g(H'.-, xf_i) = 

j-0 

where W; j is the weight associated with the connection between unit j from the (1- 
i)-th layer and unit i from the /-th layer. The bias is defined by =1 with 
corresponding bias weight w, g . The output produced of unit i from the /-th layer is 
related to its activation as x,. = /(a,.) , where /(■) is a standard sigmoidal non-linear 
function. It has a useful for the BP property: 

/'(•) = /(•)[!- /(•)]■• = /(«,) = (l + e-^"')*' , 



(7) 
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where p is the gain (in our case p « 1). 

For a given experiment with P learning samples, the difference between the 
produced and desired output data is estimated by means of a cost function 

p p _ _ 

Ep’‘ J,Ep= ~^l) > where d(‘) is a distance in SK". Most frequently this 

P-i p-i 

distance is an L 2 norm which is given by 



I P ’'L „ 

Ep--1 lieff. 

Z p-1 /■! Z pm\ iml 



The choice of the cost function is important and as pointed out in [2], different 
choices might lead to optimization problems with different minima. It could also 
give rise to spurious and structural local minima [17]. 

Usually, the weight adjustment (in pattern mode) is given by the well-known 
gradient descent technique 



Ah',-^.(A: + 1) = -r}-^^ + aAwij(k), Wij(k + 1) ~ w,-y(A:) + Aw,-y(A: + l), 



where rj and a are non-negative parameters called learning rate and momentum, 
and the iteration index is denoted with k. 

BP minimizes the cost {error energy) function (8) through adjusting the weights 
with (9). The objective cost function, represents the network mapping errors and 
describes an error surface in the weight space. In many cases this surface can be 
highly convoluted and nonconvex, with many plateaux and long narrow troughs, and 
can encounter many saddle points and LM [11]. This can make learning very 
difficult and can fail BP to find optimal solution in reasonable time. 

With our method of learning, while keeping the forward pass of the BP, we 
modify the backward pass, substituting gradient descent with the LP^ search for a 
global minimum of the cost function (8). Initially, we create the network weight 
vector as a ^-dimensional real Euclidean vector W , whose components consist of all 
weights of the network. 



9 = 2(«m +!)«/ • 

/-I 



The associated with /-th layer matrix of weights 0</sT we 

represent with a layer vector by concatenating its rows. Then, concatenating all layer 
vectors, we receive the network weight vector W . This vector is defined in the 
weight space £2^, in which we define a q-dimensional hyper parallelepiped 
n^, (ncQ) with the constrains (3). The objective is to find a point W* from II’, 
which minimizes the error function £'(W), 
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The weight space II’ is assumed compact and the cost function £(iy) 
continuous in it, which guarantees that its values are limited away from infinite 
magnitude. Qearly, the network weight vector determines the input-output transfer 
function of the network as stated in [7]. 

The proposed algorithm can be summarized as follows: 

1. Define the number of elements q (10) of the network weight vector W and the 
weight limit intervals (3), that define II’; 

2. Give the number of testing points N for the LP^. low-discrepancy sequence; 

3. Using LP^ sequence technique, calculate a point firom n’ which defines the 

designed network weight vector W*; 

4. Propagate the network using formulae (6) and (7); 

5. Calculate the error function (8); 

6. Repeat steps 3-5 and save the better fi'om the two points (if E(W*)> E(W„g^), 

then W„^); 

7. Exit if E(W ) < £, where e is initially given “sufficiently small” value, or if the 

number of iterations is greater then N; if not, continue to repeat steps 3-7; 

8. If after exiting step 7, condition E(W ) < £ is still not satisfied, then change the 
intervals (3) considering the values of the recent best point W’; VP* Ell’, and 
if necessary, change the number of testing points N (usually increasing). 

If there exists an exact solution of the mapping problem, such as E(W) = 0 (input 
patterns can be successfully separated), then this could be the ideal situation. 
However, more common case is when such a solution does not exist, and then we 
consider a global minimum of £(W)<0 as E(W*). We accept it as the optimal 
solution of the problem, and the whole process as optimal training of the network. It 
is also possible that after several executions of step (8), the condition E(W ) < e 
still not to be satisfied. In this case, either the best point (W *) can be taken as the 
optimal one, or alternatively, it may be necessary to redefine the problem (for 
example, changing the dimensionality q, respectively changing the number of hidden 
units of the network, etc.), and repeating the whole procedure again. 



4. Computer Simulated Tests and Discussion 

In this section, we present network training with the proposed technique on 
benchmark test problems and compare the obtained results with those of the standard 
BP. In all examples, activation function (7) is used and considered ANN include 
biases. For the third problem, ANN with continuous output is also employed. For 
comparison, a standard BP with a learning rate of 0.1, a momentum of 0.9, and 
initial weights in the interval (-0.5, 0.5) was performed for all cases. 
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4.1 Classification of the XOR Problem 

Many authors use XOR as a benchmark problem when investigating error surfaces 
with local minima free conditions [3, 6, 10, 13, 15, and 18]. We also chose this 
problem and employed appropriate neural network with minimal architecture, 
consisting of two inputs, two units in the hidden layer and one output. Such network 
contains 9 connection weights and defines a 9-dimensional iq=9) learning problem. 

Table 2. Optimal weight vectors for standard BP and LP^ learning (XOR problem). 



j 






j 


w* 

BP 




1 


2.9467 


-5.1365 


6 


-5.6204 


8.9520 


2 


-6.9569 


9.7997 


7 


-5.8389 


-5.2562 


3 


-6.9556 


-7.7755 


8 


-12.325 


9.4496 


4 


8.3925 


-2.0289 


9 


12.146 


9.4316 


5 


-5.6207 


-9.4904 









We carried out the proposed in the previous section learning procedure on three 
passes, starting with intervals (13) (Fig. 2), with points from the LP^ 

sequence. For every point we calculate the cost for each pattern from the batch (P=4) 
using (8), and choose the worst error as a batch error for that point (12). Then, from 



all (N) such batch errors, we found the minimal one with (14), and consider W* as 
the weight global minimum point for the training; 


- max{£,.p(W)| W en«; p - 1, ..., 4} , 


(12) 


n’ = {Wj| - 20 s Wj s 20; y = 1,2, ..., q} , 


(13) 


£(W*) = min{F;.;i = l,2,...,lV}. 


(14) 



W9 

Wg 



W2 

Wi 



Wj/ 






m. 


99 
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m 












■20 -io c 


) 10 20 






Fig. 2. Weight histograms of narrowing search intervals for XOR problem ( — first 
pass, — second pass, third pass). 
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In [6], 50% success (failure) rate of the run simulations with gradient descent are 
reported to get stuck in LM, where their tunneling method (its convergence is not 
guaranteed for a multidimensional case) managed to escape local minima. BP 
training with much higher successful percentage of 91.3% is reported in [15] after 
10000 trials, but they consider a GM reached when the summed squared error is less 
than 2.5% per pattern. In our case, we received 0.0047% error for the BP learning 
and 0.013% error for the LP^ learning. 

Table 3. Test with noisy input, and output results for BP local minimum (LM), LP^ global 
minimum (GM), and Global Descent (GD) minimum for XOR problem. 



Pattern 


Target 


LMBP 


GM LP^ 


GD 


0.15 0.08 


0.0 


0.030 


0.010 


||H 


0.92 0.12 


1.0 


0.995 


0.976 




0.09 0.91 


1.0 


0.995 


0.984 




0.93 0.95 


0.0 


0.010 


0.016 




0.50 0.50 


0.0 


0.995 


0.014 





We trained the network with an additional pattern (0.5, 0.5) and target 0 (so-called 
XOR5 problem in [8]). Obtained nine weights with the two methods are given in 
Table 2. BP weights define a LM solution that separates XOR task, but fails to 
separate the additional pattern of XOR5. It is seen from the last row of Table 3 that 
the network trained with our method succeeded. In the same time BP is quicker, our 
method lasted 23 and 5 seconds for the last two passes with N=2^“* and N=2^^ points, 
respectively (Table 4). However, it is worth finding a global minimum even at the 
expense of more training time. In the last column of Table 3, results obtained by 
Cetin et al. with their TRUST technique, named Global Descent (GD) in [6], are 
given. It is seen that their results are worse than ours are, and should also be noted 
that they are obtained for the correspondent noisy-free patterns from the first 
column. Table 4 shows the necessary time when different number of points N is 
chosen for the training. 

Table 4. Necessary time for different number of searching points (XOR problem). 



N rxl024] 


1 2 4 8 16 32 


64 




256 


Time [sec] 


1 2 5 11 23 53 


119 


265 


614 



4.2 Classification of K-Input Parity Problem 

When solving the X-input parity problem, the network must produce a one if the 
input has an odd number of one inputs, and a zero otherwise. It is considered a good 
benchmark problem for evaluation of network training methods, since the output is 
sensitive to every single input change. The XOR is a special case (X=2) of the input 
parity problem. We consider the problem when X=4. The network architecture is 4- 
4-1 (four inputs, four hidden units, one output, and a bias). This architecture 
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contains 25 connection weights and thus defines a 25-D learning problem (q=25). 
There are P=16 input-target patterns for the training set. We performed the learning 
in a similar manner with the only difference that instead (12 ) we used 

Ft = mm {\iElpiW)\W&U‘>-,P = 16 ). 

2 pml 

Again, we carried out the proposed learning algorithm in several passes, starting 
with intervals (-15, 15) for all weights, with A^=2^* points from the LP^ sequence, 
and subsequently narrowing the intervals and decreasing the number of points. 

Table 5. Optimal weight vector for the LP^ training (X^-input parity problem). 



j 




j 




j 




j 




1 


2.591 


8 


4.783 


15 


-6.785 


22 


2.973 


2 


-0.196 


9 


-4.920 


16 


8.641 


23 


-9.997 


3 


-0.644 


10 


4.958 


17 


-5.489 


24 


-11.25 


4 


1.885 


11 


3.435 


18 


-5.194 


25 


12.581 


5 


-0.921 


12 


-6.294 


19 


5.431 






6 


2.612 


13 


-6.364 


20 


-5.086 






7 


5.040 


14 


6.690 


21 


0.462 







The obtained 25 optimal weights with the LP^ learning are given in Table 5. With 
these weights, the produced training error (15) is 0.005% and the error for a test 
with noisy input is 0.021%. The values for the BP are 0.14% and 0.78% 
correspondingly. 

Table 6. Necessary time for different number of points (^f-input parity problem). 



N [xl024] 


1 2 4 8 16 32 64 128 


Time [sec] 


2 5 12 29 67 146 334 753 



The time for BP in the above case was 4 sec. The necessary time for our training, 
using a different number of testing points is given in Table 6. Comparing the errors 
given in [6] for this case, ours are considerably smaller (they reported global solution 
with 0.22% error for noisy-fi'ee input). We were not able to compare the times, 
because the authors do not report such. Our optimal result does not depend on the 
initial conditions, as [13] reported multiple entrapment of BP in local minima (initial 
random weights in (-0.5, 0.5)), and relatively low success rate (32%) after 30000 
iterations (initial random weights in (-1.5, 1.5)), for the same network architecture. 



4.3 Classification of Lenses Problem 

We obtained the database for this small, fitting contact lenses benchmark problem 
from the UCI repository of databases. The task is whether to fit a patient with hard. 




Neural Network Learning Using Low-Discrepancy Sequence 265 



soft or no contact lenses at all. The data set is composed of 24 instances, and each of 
them features four attributes. Three of the attributes (spectacle prescription, tear rate, 
and astigmatic) have binary values, and one (age) is ternary-valued. The data is 
distributed between three classes as follows: hard lenses - 4 instances (16.7%), soft 
lenses - 5 (20.8%), and neither - 15 (62.5%). We trained two different networks to 
recognize this problem. One, producing binary output, with architecture 4-4-3 (one 
hidden layer with 4 units), with the three output classes coded as (1, 0, 0), (0, 1, 0), 
and (0, 0, 1), respectively. This architecture contains 35 connection weights and thus 
defines a 35-dimensional learning problem (q=35). The other network with 
architecture 4-3-1 (three hidden layer units) produces continuous output with the 
three classes coded as 0.9 (hard), 0.5 (soft), and 0.1 (neither). This architecture 
defines a 19-dimensional learning problem (q=19). We combined two sets of data 
(one for training and one for testing) with 12 instances in each, having in both 
nearly the same class distribution. We used (15) again for optimal learning, starting 
with initial intervals (13) for all weights, with N=2^^ points from the LP^ sequence, 
and P=12 patterns. 

Table 7. Obtained errors in % for the two trained architectures (Lenses problem). 



Size 


BP train 


BP test 


LP^ train 


LPj. test 


4-4-3 


0.13% 


17.5% 


0.47% 


0.92% 


4-3-1 


0.35% 


3.10% 


0.40% 


0.42% 



Using the proposed algorithm, we received optimal solution with errors (15), 
given in Table 7. As it can be seen fi-om Table 7, the obtained errors from the BP are 
smaller for the training set, but poorer for the testing set where it failed to separate 
three of the patterns. The time for BP in that case was 3 and 2 seconds, respectively. 
The necessary time for our training, using different number of testing points N is 
given in Table 8. 

Table 8. Necessary training time in seconds for the two architectures (Lenses problem). 



N fxl0241 


1 


2 


4 


8 


16 


32 


64 


128 


Time (4-4-3) 


4 


7 


17 


41 


94 


215 


488 


1101 


Time (4-3-1) 


2 


4 


9 


23 


51 


114 


262 


589 



The computational time depends on the number of testing points, and the number 
of patterns (instances). The number of testing points depends on the size of 
parameter intervals (parameter space), and on the error threshold s . Of course, the 
iterative narrowing of intervals could lead to decreasing the number of searching 
points, whether decreasing the value of the error threshold e would increase their 
number, respectively increasing the search time. Not always all the searching points 
are computed. If the minimum error (step 7) is reached, the training process stops. 
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5 Conclusion 

The introduced method for modified feed-forward optimal learning is presented with 
results on standard benchmark problems and proved as an efficient and reasonably 
simple technique for a small size ANN. It does not depend on the initial conditions 
and compared to the standard BP, it has the advantage to avoid local minima and to 
produce optimal learning. However, it is more time consuming, especially in the 
initial passes of the proposed procedure. With the increase of dimensionality, the 
number of testing points should be increased which, in turn, increases the 
computational load. Future directions for this work will concentrate on this problem. 
One of the advantages of our method is that it does not make assumptions for the 
transfer functions to have continuous and differentiable first and second derivatives, 
and solves the original learning problem - finding a global minimum of the error 
function, not a system of derivative equations. 
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Abstract. If voting is used by an ensemble to classify data, some data points 
may not be classified, but a higher proportion of those which are classified are 
classified correctly. This trade off is affected by ensemble size and voting 
threshold. This paper investigates the effect of ensemble size on the propor- 
tions of decisions made and correct decisions. It does this for majority voting 
and consensus voting on ensembles of neural network classifiers constructed 
using bagging. It also models the relationships in order to estimate the as- 
ymptotic values as the ensemble size increases. 



1 Introduction 

Using the results of several classifiers is a technique which has been shown to give 
more accurate classification than a single classifier [1], [2], [3]. The resulting classi- 
fier is known as an ensemble. 

The most popular methods of constructing ensembles are bagging [4] and boost- 
ing [5]. Both methods generate multiple classifiers by resampling the training data. 
Bagging (bootstrapping aggregates) trains the component classifiers using independ- 
ent samples drawn with replacement from the training data. Boosting creates a suc- 
cession of classifiers by giving greater weight to data points misclassified by previ- 
ous classifiers. 

In an ensemble constructed by bagging, the ensemble may classify a data point 
by averaging or voting. When averaging is used, the predictions of the component 
classifiers are averaged to make the ensemble classification. With voting, each 
component classifier votes for a category and the ensemble category is the category 
with the most votes. These methods may be modified by weighting the classifiers 
according to their individual accuracy. 

The most popular method for ensemble classification is unweighted averaging 
[4], [6], typically with the outputs of each component classifier being normalized. 
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Part of the reason that voting is not as popular is that it does not use all of the infor- 
mation available. It does not distinguish between a weak and a strong preference by 
the component classifier. Voting, however, does give the opportunity not to make 
decisions where there is insufficient agreement. This can increase the accuracy of 
classification where a decision has been made. There is thus a trade off between the 
proportion of data points for which a decision is made and the proportion of those 
data points which are correctly classified. As the threshold of agreement is raised, 
fewer data points are classified, but a higher proportion of those are classified cor- 
rectly. Similarly, as the number of classifiers in the ensemble is increased, fewer 
data points may be classified, but again a higher proportion of those are classified 
correctly. 



1.1 The Effect of Misclassification 

For some classification problems, every data point must be classified. In that case, 
the best classifier is the classifier with the highest proportion of conect classifica- 
tions. For other problems, it is more important to reduce the number of misclassifi- 
cations by not making decisions for some data points in order to increase the pro- 
portion conect when a decision is made. 

An example is in recognizing postcodes. If an error is made and the article is sent 
to the wrong area, recovery is expensive. It is better to make fewer decisions and 
rely on manual sorting for the remaining articles. When recognizing postcodes, any 
classification error is expensive. In other problems, a classification error in one 
category may be more serious than in another. For example. Smith et al [7] report 
on a neural network which predicts post-operative bleeding. The aim of their study 
was to enable medical staff to start treatment earlier and to reduce the amount of 
drugs administered. In this case the effect of incorrectly predicting that a patient is 
not at risk of post-operative bleeding is more serious than of inconectly predicting 
that a patient is at risk. In the latter case, the patient may be given unnecessary 
treatment. In the former case, an untreated patient may die. 

This study does not differentiate between false positives and false negatives when 
a binary classification is made. 



1.2 The Aim of this Study 

The aim of this study is to investigate the effect of ensemble size and voting thresh- 
old on the proportion of decisions made and the classification accuracy of those 
decisions. The voting thresholds considered are consensus (all classifiers must 
agree) and majority voting (over half of the classifiers must agree). We use feedfor- 
ward neural networks trained by backpropagation as our component classifiers and 
bagging as our method of constructing ensembles. 
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2 Methodology 

Each data set is divided into three parts, a training set, a classifiers testing set and an 
ensemble testing set. Each classifier is trained on a set drawn with replacement from 
the training set and of the same size as the training set. The classifiers testing set is 
used to test each classifier. The ensemble testing set is used to determine the classi- 
fication accuracy of the ensemble. 

Majority voting is reported on for ensembles with an odd number of classifiers. 
We choose odd ensemble size because an even size gives the possibility of half of 
the votes being given for one category and half for another. The ensemble makes a 
decision if (n+l)l2 component classifiers agree. For consensus voting, all classifiers 
must agree. The ensemble sizes tested are from 3 to either 39 or 49 depending on 
the data set. The details for each data set are given in Figure 1. 

Figure 1. Sizes of data sets and ensembles 



Data set 


size of 
training set 


size of classi- 
fier testing set 


size of ensem- 
ble testing set 


largest 

ensemble 


Abalone 


2177 


1000 


1000 


49 


Cancer 


483 


100 


100 


49 


ard 


350 


170 


170 


39 


Diabetes 


388 


190 


190 


39 


Heart 


460 


230 


230 


39 


Weedseed 


198 


100 


100 


39 



3 Data Sets 

We have used the Abalone, Cancer, Card, Diabetes, Heart and Weedseed data sets 
in our study. 

In the Abalone data set [8], [9], abalone shellfish are classified into one of three 
aged-related groups according to eight measurements made on abalone collected by 
divers in a survey. The set contains 4177 samples. The proportion of data which is 
correctly classified by a single backpropagation classifier is moderate (64%). 

In the Cancer data set [9], [10] nine measurements are used to classify tumors as 
benign or malignant. There are 683 samples. The proportion of data which is cor- 
rectly classified by a single backpropagation classifier is high (94%). 

In the Card data set [9], 51 measurements are used to determine whether approval 
should be given of issuing a credit card to a customer. The proportion of data which 
is correctly classified by a single backpropagation classifier is high (86%). 

In the Diabetes data set [9], 8 measurements are used to predict whether a Pima 
indian individual is diabetes positive. The proportion of data which is correctly 
classified by a single backpropagation classifier is high (78%). 
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In the Heart data set [9], 35 measurements are used to determine whether one or 
more of four major vessels is reduced in diameter by 50% or more. The proportion 
of data which is correctly classified by a single backpropagation classifier is 
high (80%). 

In the Weedseed data set [11], weed seeds are classified into one of ten types, 
based on seven measurements of dimensions of the seeds. The data consists of 
measurements of 398 different seeds, giving 39 or 40 examples of each seed type. 
The proportion of data which is correctly classified by a single backpropagation 
classifier is only moderate (63%). 



4 Results 

We will first consider the heart data. Figure 2 shows the proportion of times that a 
decision was made using consensus voting for different ensemble sizes, along with 
the proportion of those decisions that were correct. Thus the solid line gives the 
probability of an ensemble of a given size making a decision, and the dashed line 
gives the conditional probability of an ensemble of a given size making a correct 
decision, given that a decision was made. Figure 3 shows the same information for 
majority voting on the heart data. 

Figure 2. Results of consensus voting on the heart data. 




As expected, under consensus voting, the probability of making a decision de- 
creases slowly as the ensemble size increases. But given that a decision has been 
made, the probability that it is the right decision increases as the ensemble size in- 
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creases. This implies that although decisions will be made less often with large 
ensembles, the decisions made are more likely to be right. 

In order to estimate the asymptotic value of the proportions of decisions made 
and correct decisions made as n increases, non-linear functions of the form y = 
a + be” were fitted to the data shown in Figure 2. The variable y is the percentage of 
decisions made and the percentage of correct decisions made; n is the ensemble 
size; and a, b and c are parameters requiring estimation . The statistical package 
SPSS was used to carry out the estimation, and the following functions were ob- 
tained. 

Percentage of decisions made = 69.82 + 24.16(0.92)” 

Percentage of correct decisions made = 90.04 - 4.91(0.95)” 

The first function suggests that as the ensemble size increases, the percentage of 
decisions made follows a power law that starts at about 90% for an ensemble of size 
1 and reaches an asymptote at about 70%. The second function suggests that as the 
ensemble size increases, the percentage of correct decisions made follows a power 
law that starts at about 85% for an ensemble of size 1 and reaches an asymptote at 
about 90%. These functions together could be used to predict both the percentage of 
decisions made and the percentage of correct decisions before embarking on con- 
struction of a neural network, thus allowing researchers to consider where effort is 
best spent in order to achieve aims. 



Figure 3. Results of majority voting on the heart data. 




% correct 
% decisions 



ensemble size 





An Investigation into the Effect of Ensemble Size and Voting Threshold 273 



Under majority voting, the pattern is slightly different. The probability of making 
a decision is 1. Then given that a decision was made, the conditional probability that 
the decision is correct decreases slightly as the ensemble size increases. No model- 
ing of proportions was therefore carried out for majority voting because the relation- 
ship between ensemble size and proportion of decisions or correct decisions made 
was much more predictable, with no power law required. 

This pattern of decision making is similar for the abalone data, the card data and 
the diabetes data. The cancer data is similar as well except that under consensus 
voting the probability of making a correct decision is 1 for all ensemble sizes. This 
can be regarded as the ideal situation that the other four data sets aim to achieve but 
do not quite do so. 

The percentages of correct decisions when a decision is made using majority 
voting are given in Figure 4. The single classifier results are included for compari- 
son. The results show very little movement as the ensemble size is increased except 
in the case of weedseed data. The percentage of decisions made is not shown as it is 
close to 100% for all ensemble sizes for all data sets except weedseed where it re- 
duces from 96% for ensemble size of 3 to 92% for ensemble size of 39. 

Figure 4. Percentage correct vs ensemble size, majority voting 



Data set 
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39 
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64.1 


63.8 


63.6 


63.7 
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94.3 


96.0 


96.0 


95.0 


Card 


86.4 


86.5 


87.0 


86.5 


Diabetes 


78.5 


78.4 


78.4 


80.0 


Heart 


80.4 


81.7 


80.4 


79.1 


Weedseed 


63.2 


67.7 


71.7 


76.1 



The percentages of correct decisions when a decision is made using consensus 
voting are given in Figure 5. The percentage of decisions made is also shown. They 
demonstrate the trade off expected between increasing the percentage of correct 
decisions at the expense of fewer decisions being made as the ensemble size in- 
creases. In most data sets the trade off is reasonable. The cost of improved accuracy 
is a moderate reduction in the proportion of decisions made. Again weedseed is 
atypical. 
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Figure 5. Percentage correct and decisions made vs ensemble size, consensus voting 
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52.0 
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As indicated in Figures 4 and 5, the weedseed data is a severe test for the ensem- 
ble, and the results reflect the difficult in carrying out the classification. Figure 6 
shows the results under consensus voting, and Figure 7 shows the results under ma- 
jority voting. Here we see that under consensus voting, the probability of making a 
decision decreases not just slowly, but sharply, as the ensemble size increases. In- 
deed by the time the ensemble contains about 30 classifiers, the probability of mak- 
ing a consensus decision has sunk to about 10%. On the other hand the conditional 
probability of making a correct decision takes similar values to those observed in the 
other data set. This probability starts at about 70% and increases slowly as the en- 
semble size increases. 

When non-linear functions of the form y = a + be” were fitted to the data shown 
in Figure 6, the following estimates were obtained. 

Percentage of decisions = 9.64 + 56.15(0.89)° 

Percentage of correct decisions = 92.01- 27.74(0.899)° 

The first function suggests that as the ensemble size increases, the percentage of 
decisions made follows a power law that starts at about 60% for an ensemble of size 
1 and reaches an asymptote at about 10%. The second function suggests that as the 
ensemble size increases, the percentage of correct decisions made follows a power 
law that starts at about 67% for an ensemble of size 1 and reaches an asymptote at 
about 92%. As before, these models could be used for prediction of success rates in 
a neural network and decision-making about the number of classifiers required to 
achieve stated aims. 
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Figure 6. Results of consensus voting on the weedseed data. 




Under majority voting the effects are similar, but less dramatic. The proportion of 
decisions made falls from 96% with an ensemble of size 3 to 92% for an ensemble 
of size 39. At the same time the percentage of correct decisions made increases form 
63% to 76%. Weedseed thus shows a more improvement in classification accuracy 
than the other data sets when majority voting is used. This is in agreement with the 
findings of Maclin and Opitz [12] who report that the gains in performance of bag- 
ging over a single classifier is greater on data sets where there more than two cate- 
gories. (It should be noted that they use averaging rather than voting and they also 
report that in these data sets boosting is more effective than bagging.) 

It appears therefore that the weedseed data follows a pattern that is atypical. We 
speculate that the large number of categories in the weedseed data (10 compared to 
2 or 3 in the other data sets) contributes to the difficulty of decision making both by 
consensus and majority voting. Other factors which may contribute to fewer deci- 
sions being made are that weedseed is a relatively noisy data and there are only 
about 20 points in each category in the training set. For weedseed, the standard 
deviation of the individual classifiers was 4%, compared to 1% for each of the other 
data sets. 
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Figure 7. Results of majority voting on the weedseed data. 




% correct 
% decisions 



5 Conclusion 

The above results and discussion are consistent with the following conclusions. 

1. The proportion of conect decisions made can be significantly improved by using 
consensus voting. 

2. For most data sets the penalty in the reduction in the proportion of decisions 
made is moderate. 

3. In the weedseed data under consensus voting the proportion of decisions made 
drops so sharply that the technique is not useful. This may be due to the large 
number of categories, noisy data and / or relatively few points in the data set. 

4. Majority voting is not as effective as consensus voting in increasing the propor- 
tion of correct decisions made except in the case of the weedseed data. This may 
also be true in other data sets which share weedseed’s characteristics of being 
noisy, small and multi-categoried. 
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Abstract The era of mobile robotics for use in service and field applications is 
gaining momentum. The need for adaptability becomes self evident in allowing 
robots to evolve better behaviors to meet overall task criteria. We report the use 
of neuro-fuzzy learning for teaching mobile robot behaviors, selecting exemplar 
cases from a potential continuum of behaviors. Proximate active sensing was 
successfully achieved with infrared in contrast to the usual ultrasonics and 
viewed the fi’ont area of robot movement. The well-known ANFIS architecture 
has been modified compressing layers to a necessary minimum with weight 
normalization achieved by using a sigmoidal function. Trapezoidal basis 
functions (B splines of order 2) with a partition of 1 were used to speed up 
computation. Reference to previous reinforcement learning results was made in 
terms of speed of learning and quality of behavior. Even with the limited input 
information, appropriate learning invariably took place in a reliable marmer. 



1 Introduction 

There are many ways in which a learning phase for machine intelligence can now 
be arranged with fuzzy, neural, genetic and reinforcement algorithms [1],[2],[3] 
including combinations like neuro-fiizzy systems [4],[5]. The test bed of mobile 
robots is a suitable environment to evaluate the worth of each paradigm since direct 
comparisons of performance for exemplar problems can be made. It is also recognized 
that mobile robotics is the fastest growing area of robotic innovation at present 
leading to robots useable in service and field applications from hospital gofors [6], to 
organizing themselves in a factory situation [7]. A large body of theoretical work is 
now slowly being matched by experimental verification as a range of versatile 
platforms become available such as Nomad, Kepera, Rug Warrior, Pioneer and 
Helpmate, details of which are accessible on the web. 

The theory of machine learning reaches back some 50 years. Practical uses have 
gained prominence in the last 10 years. The concept of defining a suite of behaviors 
has received attention recently [8],[9],[10],[11],[12]. Kelly & Keating [13], have 
investigated the reinforcement learning algorithm for a fleet of mobile robots and 
shared learning over radio links using ultrasonic sensing of proximity. Simulation 
[14][15] has helped to evaluate a range of interaction scenarios. 

Song & Sheen [16], have obtained neuro-fiizzy obstacle avoidance in the context 
of indoor navigation and concluded that smooth operation was possible. They used a 
ring of 16 ultrasonic sensors providing very rich information upon which to work in a 

N. Foo (Ed.): AI'99, LNAI 1747, pp. 278-290, 1999. 

© Springer-Verlag Berlin Heidelberg 1999 




Neurofuzzy Learning of Mobile Robot Behaviours 279 



cycle time of 250ms. Tsoukalas et al [17], devised neuro-fiizzy motion planners for 
mobile robots using an idealised model called MITOS. Their model predicts that 
smooth motion is possible even for rudimentary sensor input. Tschichold- 
Gurman [18], attempted to combine the best of processing strategies such as neural 
networks, fuzzy logic and classical control theory to achieve mobile robot navigation. 
A basic feedforward neural network has a usual fuzzy input attached to it. A set of IF- 
THEN rules are used to shape the net. It is claimed the learning time is much reduced 
over other classifiers. Aycard et al [19], used a fuzzy controller for reactive 
navigation of a mobile robot in an unknown environment. A two layer system 
corresponded to local and global behaviors. Experiments with a NOMAD robot 
showed navigation and avoidance could be simultaneously achieved but there was no 
learning. Gaussier et al [20], have taken an interesting approach in employing a 
neural net structure to learn by imitation. The perception-action (PerAc) architecture 
is very similar to a neuro-fuzzy one where the basis functions have been tailored to 
give temporal and spacial discrimination. Vision analysis formed the basis of sensor 
input. Rylatt et al [21], recently surveyed the field as connectionist research. They 
emphasize the importance of rich environments if learning is to take place and 
conclude, perhaps pessimistically, that connectionist architectures may never provide 
the key to generating cognitive capacities similar to the human brain. On the other 
hand, they feel there is some hope in hand-CTafting architectures such as subsumption. 
Kim & Trivedi [22], very recently reported similar work to that presented here. In 
their case, the learning was supervised via the back propagation algorithm and the 
extent of the fuzzy rule set was predetermined. This essentially makes the learning 
significantly dependant on the teacher for sensible choice of rules. 

We present findings on the determination of learning in a continuum of potential 
behaviors. The neuro-fiizzy calculations are similar to the ANFIS model [2], but 
incorporate least mean squares (LMS) analysis without the need to invoke back 
propagation or be selective on rule choice. TTie work reported here is part of an 
investigation into devising parametric ways to evaluate learning paradigms. To that 
end, a consistent mobile robot platform was created that had a range of sensing and 
movement capabilities backed by computation. The ideal platform should consist of 
life-size devices but the cost when dealing with a fleet of five upwards is high. A 
similar result can be obtained at a lower level of expenditure giving good relative 
quantitative results. 

Of general interest has been learning rates and how they are affected by the 
richness of the environment, the need for normalising to keep function surfaces 
within bounds, annealing to avoid local minimums, the overall specification of 
desired behavior and platform programming environment. Present results indicate 
that learning for behaviors such as avoidance can be adiieved in periods of a few 
minutes with rudimentary sensors at the 68HC11 computing power level chosen to 
work at. A scaling to the computational speed of a target system is easily achieved. 
Infia red is used as the sensing medium relying on strength of return signals from 
robot on-board emitters to be a measure of the closeness to objects around the robot. 
Whilst infirared is a noisy medium and can be subject to interference from a wide 
range of extraneous sources such as daylight, hot bodies and artificial lighting, it 
makes for a realistic platform in testing learning algorithms. 
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2 Platform 



As mentioned, the platform (Rug Warrior Pro) was chosen to keep costs within 
reasonable limits. The infrared sensors of the Rug Warrior were modified to sense 
either side of and directly in the forward aspect with good directionality to the return 
signals to the mobile robot. The sensors can be clearly seen in Fig. 1. Infrared filters 
were also incorporated to reduce the interference from daylight and nightlight 
spectrums. Two different techniques were used. Firstly, pulsed emission from a 
standard infrared emitter/photodiode pair was used to increase noise immunity. 
Secondly, a proprietary small infrared range finder (SHARP GP2D02) operating in 
the range 70mm to 800mm was used instead to check on how the accuracy of distance 
measurement affected learning quality. The operating environment enabled programs 
to be written subsumptively. Radio communication transceivers were available to 
exchange learning data when working in a multi-robot scenario. 




Fig. 1. The Rug Wanior Pro platform is available in kit form already incorporating 
infrared sensing as well as the capability of using light and sound. The left image 
shows the simple sensing and the right image shows mounting of the proprietary 
infrared range finder. 



3 Theory 

The neuro-fuzzy algorithm was based on modifying the ANFIS model of learning 
incorporating the standard LMS analysis for a system of linear equations containing 
errors [2]. A system of linearly seperable equations related to some form of MISO 
causal system can be represented by 

A(u).c=y. (1) 

where A(u) is the matrix of known functions f on u, the vector of inputs, c is the 
vector of the regression coefficients, and y is the vector of outputs. 
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Hence aij equates to fi(uj) in a matrix of i columns and j rows. For a MIMO system, 
c and y become matrices. In equation (1), for a neuro-fuzzy system, the column 
components of A are the membership function values of the full set of inputs whilst 
the rows are repetitions of any experiment or exemplar data that is used to evaluate c. 
c now becomes synonymous with the weights in a neural net leading to the output 
nodes. In practice, the system has noise present and (1) has to be rewritten to give 

A(u).c + e = y. (2) 

where e is an error vector. 

It is now not possible to find an exact solution to (1) but we can obtain the best 
estimate 

E(c)=2(yi -ai'^c)^ =(y-A.c)'^(y-A.c). (3) 

It can further be shown that, instead of attempting to obtain the direct inverse of A 
by keeping the dimensions of A square, it is better to allow for a variable number of 
input/output pairs and solve for the pseudo-inverse, such that 

c = (A\A)-^A’'y. (4) 

where c is a least mean squares estimator and A'^ is replaced by (A^.A)'^A^ to 
make A^A square. 

If the computations of the pseudo-inverse are made as shown above to determine c, 
the scalar values within c can be essentially unbounded. Therefore, as the stages for 
computing the pseudo-inverse are completed, a sigmoidal function is applied to the 
(A^.A)‘^ term to normalize before the final multiplication. It then remains to apply a 
scaling function to adjust weight values to suit the range of motor velocities that were 
used to obtain the exemplar data in the first place. 

Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 



X 



y 



Fig. 2. The Generalised ANFIS Architecture. In general, Layer 1 refers to fuzzification. Layer 2 
enacts the AND or min function, Layer 3 normalises. Layer 4 gives options for implementing a 
Sugeno, Tsakamoto or like consequent result for an output and Layer 5 is the usual neural 
summation term. 

In Fig. 2, it is possible to compress the architecture into three layers retaining 
Layer 1 that fuzzifies, omitting Layer 2, normalising in Layer 3, omitting Layer 4 and 
retaining a general neural summation in Layer 5. In this way, the fuzzification is 
virtually transparent to the analysis as long as the membership functions give a 
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partition of unity on each input. In a sense, they act as linear separators of the 
magnitude of each input into the member function categories. 

The question then arises as to what constitutes input and output for the mobile 
robots. In our experiments, state space is quantised (see Fig. 3) so that movement is 
divided into eight quadrant parts, the stand still movement being excluded because it 
always generates a zero differential. Thus output y corresponds to two-off angular 
velocities for each of two motors making y dimensions 2*n i.e. a matrix, where n is 
the number of results taken per calculation. Input comes from two sensors each of 
which has three membership functions to fuzzify into the nil, far and near categories. 
Thus two inputs become six categories in the A matrix giving it dimensions 6*n. 

Quantisation of Output 
Fwd left Fwward Fwd right 

Nil Uft 



Bkwd left Backward Bkwd right 





Fig. 3. The Robot Quantisation of State Space. The potential behavior of the robot was 
quantized for the benefit of allocating membership functions to input values and output motor 
movement. 
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Fig. 4. A set of behaviors is really a continuum depending on sensor input interpretation. 

The avoidance behavior is one among a suite of behaviors derived from the 
definition of how the strength of return signals from sensors should correlate to the 
motion control. If attraction is required, then the detected relationship is used 
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unmodified. If avoidance is required, then the relationship is inverted. It is also 
possible to bias the relationship in such a way that a continuum of behaviors can be 
learnt. Fig. 4 illustrates the fundamental modes to be expected. 

There still remains the overall interpretation of sensor input. It transpires that it is 
possible to detect the proximity of obstacles or objects by noticing that the function 
determining their distance away is close to an exponential decay. Theoretically, an 
inverse square law may prevail but is practically dependent on target size as a 
function of distance from the robot. A test was made of sensor signal strength as a 
function of distance to see how close the approximation desired was. Fig. 5 shows the 
relationship obtained for a constant size target being offered normal to the 
propogation direction of the infrared signals. 



Exponential approximation of 
sensor readings 




Dfitence (cm) 



Fig. 5. The sensor functions differentiate to themselves with a scaling factor added. Simple 
refers to using an emitter/sensor photodiode pair. GP2 refers to a proprietary infrared range 
finder. This is useful in that it is now possible to interrogate the environment by determining 
the differential of sensor input and treat it as distance information via scaling. 

The approximation of the sensor function to a decaying exponential is remarkably 
close. A look-up table is normally needed to obtain real distance and is based on a 
stereotyped target held normal to the robot. 



4 Experiments 

Avoidance is a part of a spectrum of generic behaviors depending upon the 
interpretation of what sensor inputs mean. Attraction is the converse and wall 
following is a band pass version of them both. Hence avoidance has been used here to 
demonstrate learning rates for comparison purposes. The environment also plays a 
key role in affecting learning rates. Exposure to both rich and sparse environments 
has shown an obvious effect on learning convergence rates such that the richer the 
environment, the faster the learning. 

The mobile robots were programmed so that calculation of the LMS estimator, £, 
containing the desired weights for driving the motors, was repeated over the same 
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repertoire of movements covering all possible ways the robot could move. This 
amounted to using a block of 24 separate movements, repeating each of the eight 
possible moves thrice. On this basis, it was possible to watch the robot move around 
sensing proximate objects and then for the experimenter to rearrange the objects from 
time to time as the actions were repeated. An action consisted of moving motors set 
distances in combination to achieve small changes in both direction and position. 
Sensor input was measured before and after movement, the difference, as explained in 
section 3, being used as the scaled distance and location of the object from the robot. 

Bearing in mind the potential noisiness of the system sensors, a rolling average 
routine was then applied to the blocks of 24 results as they were obtained. Hence it 
was possible to see how stable the weight values were between blocks of data and to 
note any longer term changes. Sets of upward of 30 pseudo-inverse calculations on 
each block of 24 data were completed, the rolling average applied and the changes to 
weights observed. Robots were then able to change to a non-learning mode and their 
behavior to be fixed. 



5 Reinforcement Results 

Some previous results [23] on learning with reinforcement using a modified Q- 
learning algorithm, have been obtained on the platform in use here. Here, the standard 
pulsed emitter/sensor technique was used. 
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inputs detect 'near left/nil’ case 
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Fig. 6. Reinforcement learning times on the target mobile robot showing retention of some 
multi-mode behavior. 

For the purposes of comparison. Fig. 6 shows the resolution to a near-dominant 
choice of action when detecting an obstacle in the near left/nil position, nil meaning 
that on the right sensor, essentially very little incremental change in the return signal 
was detected. The learning has been terminated at around 900 iterations and shows 
clearly the iteration points vvfren each reinforcement took place. Actions Wght’ back 
right’, left’ and back left’ are all still retained as significant with a tendency for light’ 
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to become the dominant result given enough time. The roulette viieel method of 
choosing the next action clearly shows there is scope for non-optimum actions to be 
tested long after a more successful action has developed a high probability of choice. 




Fig. 7. Net vectoral movement of the mobile robot after reinforcement learning is complete. 
The legends near right etc. refer to the categorization of sensor input. 

Fig. 7 shows all the reinforcement learning results, except for the input duple 
(nil/nil), expressed as net vectoral movement based on the learning that had taken 
place after 900 iterations, equivalent to approximately 7 minutes of running time, an 
update rate of 2Hz. Much of this time is devoted to actually moving the robot in order 
to get a difference input. Hence the speed of learning is not computationally bound. 
The results indicate the robot would move in a sensible direction to avoid obstacles 
detected all around the forward-looking aspect. It should also be noted that the 
resultant avoidance behavior automatically obtains to movement forward if no 
obstacle is in the way. 



6 Neurofuzzy Results 

As outlined in Section 4, the experimental procedure in a neuro-fiizzy system used 
the LMS method for obtaining learned weights. Energizing motor movement to obtain 
a particular behavior relies on the accumulation of at least one block of 24 exemplar 
data before a result is possible. This is in contradistinction to the incremental method 
used in Q-learning reinforcement where improvement is seen to be gradual and 
evolves at every interrogation of the environment. 

Fig. 8 shows a series of independent results. Calculations using the pseudo-inverse 
formula, equation (4), revealed that some results were sensitive to rounding errors in 
the evaluation of the inverse (A^.A)'* obtained by a standard Gauss-Jordon 
elimination method. It can be seen that the prediction of weights is variable across all 
results, sometimes with good reproducibility. Fig. 9 shows the effect of incorporating 
past results to improve the estimate of the weights with a running average. 
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The equation for the running average is 

Ck= YCk + (l-Y)Ck-l 

where Ck is the k“' result, Ci being determined with Co identically 0 and y set to 1, y is 
normally in the range of 0.1 to 0.25 and k is the iteration number. 




Fig. 8. Neuro-fuzzy learning of /e/r motor and right motor weights with the proprietary infrared 
range finder. There was a small Improvement in the quality of results when using the range 
finders instead of the standard sensors. No running average has been applied, therefore each 
result is independent of all others. 
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Fig. 9. The same results as in Fig. 8 but with a running average applied obeying the 
relationship as in equation (5) where y is 0.1. 

In order to see the significance of the results in Fig. 8 and Fig. 9, a visualization of 
behavior is shown in Fig. 10. Each horizontal bar has two arrows attached to either 
end that represent motor speeds. The bar/arrow combinations have been placed at 
various places where an obstacle is detected. Any combination of obstacles can lead 
to a superposed result for the motor speeds. ITie general fields of view of each sensor 
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are qualitatively marked in along with the ranges where the fuzzy membership 
functions near/far/nil cross over. An omission of an arrowhead means that the speed 
was virtually zero. 

The behavior of the mobile robot using the learned results lead to very little 
movement if the obstacle was far away in the nil/nil range. As the obstacle comes 
nearer, the mobile robot will back away from it in a balanced manner. This is in 
contradistinction to reinforcement learning where learning gives forward movement if 
no obstacles are detected. The time to complete the movements and calculations of 
weights from one block of data to the next was approximately 20 seconds. Given the 
inclusion of the benefit of the running average over ten iterations, the time to learn 
was estimated to be just over 3 minutes. It should be noted that environment 
interrogation via mechanical movement dominates the learning time once again since 
the final calculation of the weights involving a Gauss-Jordon inverse takes less than 
4% of the single iteration time. 




Fig. 10. A representation of the learned neurofuzzy behavior by reference to the magnitude of 
motor speed at various places where the obstacle is detected. 

Fig. 11 displays a set of three neurofuzzy learning outcomes where a forward bias 
has been introduced to either motor control for the left and right results. The nature of 
the results shifted such that the behaviors avoidance and seek right/left were 
superposed. For the left forward motor bias, the change from zero bias appears more 
modest than for the right. Both biased results show less tendency to avoid than in the 
zero bias condition (center result) for objects that are near front. Observance of 
behavior after learning showed a mobile robot seeking left/right or virtually still in the 
absence of obstacles and then proceeding right/left in avoidance preference. Bearing 
in mind the highly non-linear characteristics of the sensor detection lobes and their 
idyosynchratic shapes, the results are best evaluated by observing behavior. It must be 
noted that the result is particular to one mobile robot in one chosen environment of a 
given richness. Having said that, the learning of appropriate behavior for such an 
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unique scenario is now practically possible and reduces much of the uncertainty of 
calibrating in the absence of knowing the target environment. 




Fig. 11. An illustration of superposed behaviors. The left and right results are for seek right and 
seek left respectively superposed on top of avoidance alone which applies to the center result. 



7 Conclusions 

The mobile robot learning behavior in the real world has been characterized in 
such a way as to have confidence that learned behavior is sensible and robust even 
with noisy sensors. The avoidance behavior turns out to be a good vehicle for 
evaluating a number of issues concerning real world environmental uncertainty. 

The neurofuzzy paradigm has proved useful in generating a spectrum of possible 
behaviors from adjustment of only a few fundamental parameters in the algorithm. 
The use of neurofuzzy learning has generated the possibility of tailoring behavior of a 
mobile robot to suit the target environment superposing behavior traits into the one 
learning phase. 

Whilst there are a number of ways the platform can be improved such as using 
more powerful emitters, modulation to reduce noise and employing highly directional 
emission, the relatively slow computational speed of the platform has not been on the 
critical path. Rather, it is the mechanical movement in the real world that ultimately 
limits speed of learning. Hence the cost benefit of the small mobile robot platform is 
very hi^. 



8 Future Work 

At present, further paradigms are being loaded into the platform to build a 
parametric data base on learning technique, TTie results will be blended into the multi- 
mobile robot scenarios to enable them to reliably identify and move with respect to 
each other, communicating both by radio link and by infrared emission signal coding. 
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Abstract. The significance of representing duration information along 
with the qualitative information of the time intervals is well argued in 
the literature. A new framework iNVli (INterval and DUration) net- 
work consisting of 25 basic relations, is proposed here. XM'DIA cam handle 
qualitative information of time intervail and duration in one single struc- 
ture. It inherits many interesting properties of Allen’s Interval! Algebra 
(of 13 basic relations) but it also exhibits severail interesting additionail 
features. We present several representations of TJ\lT>U (ORD-clause, Ge- 
ometric and Lattice) and charaicterise its tractable subclasses such as the 
Convex amd Pre-convex classes. The important contribution of the cur- 
rent study is to show that for the tractable subclasses (Convex as well as 
Pre-convex) 4-consistency is necessary to guairantee globaJ consistency of 
ZA/’X>i/-network. 



1 Introduction 

The field of temporal reasoning has been a central research topic in AI since 
many years. It has been studied in a wide range of areais such as knowledge 
representation, natural language understanding, commonsense reasoning, quali- 
tative reasoning, plan generation, scheduling etc.. Temporal information is typ- 
ically represented in terms of a collection of qualitative and metric relations 
constraining time points or intervals. The main reasoning tasks are concerned 
with determining the consistency of such a collection, deducing other implied re- 
lations from the given set of relations, and finding the instantiation of the point 
and interval variables satisfying the set of constraints imposed by the set of re- 
lations. Allen [1] proposed an algebraic framework named Interval Algebra (lA), 
for qualitative reasoning with time intervals where the binary relationship be- 
tween a pair of intervals is represented by a subset of 13 atomic relations. Vilain 
and Kautz [13] proposed a sub-algebra, Point Algebra (PA), on the set of time 
points with three atomic relations between a pair of points as <, = and >. The 
origin of the 13 atomic relations can be traced back to the philosophical analysis 
presented by C.D. Broad in 1938 [3j. Broad holds the view that the temporal 
characteristics of experience fall into three different sets, (i) temporal relations 
(ii) duration and (iii) transitive aspects of temporal facts. Broad also mentions 
that two of the above aspects, namely duration and temporal relations, are very 
closely interconnected and they can be grouped under the heading of extensive 

N. Foo (Ed.): AI’99, LNAI 1747, pp. 291-303, 1999. 

(g) Springer- Verlag Berlin Heidelberg 1999 
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aspect of temporal facts. Since 1983 when lA was proposed, though a substan- 
tial amount of work have been carried out on the topic by many researchers, 
the importance on qualitative information on duration has been ignored except 
for a few isolated cases. Though lA and PA are popularly used for representing 
temporal information they cannot specify qualitative information of interval du- 
ration. Allen observes that in order to encode duration information a separate 
system, orthogonal to the lA system, is necessary. In the same spirit, some recent 
work proposed bi-network based approaches to represent information about du- 
ration [2], [8] and [14]. These recent works and the philosophical analysis by C. D. 
Broad inspired us to revisit the lA. The significance of representing information 
about durations is well argued in the literature. 

Example 1. Let us consider the following narration. John and Mary are doctors 
and work in the same hospital. Today, they arrived at their work at the same time 
and they immediately started examining their respective patients without rest. 
They examined a patient together later in the day. By then, John has finished 
with 3 patients whereas Mary has examined only 2 patients. One can easily infer 
that the John and Mary were examining the patient either ‘During the office 
howd or, ‘they started examining during office hour but stayed late to finish the 
work' or 'they started the work only after the day is over’. 

The similar story can be told in another context. John and Mary are students 
in the same Department. Today they arrived at the department at the same time 
and immediately went to their respective classes. Sometime later in the day they 
attended the same class. But by then John had attended 3 classes without any 
gap and Mary had attended two lectures without rest. One can see that the lA- 
network for both these stories is the same. But there is some implicit information 
in the second one. That is, the duration of all the classes are of same length. If 
we incorporate this information about duration, the given information becomes 
inconsistent. 



Example 2. Let us consider the scheduling of concurrent processes where each 
process is comprised of certain atomic transactions. In a database environment, 
each atomic transaction can be visualised as a seek operation. Depending on the 
hardware and the file size the actual seek time can be calculated. This informa- 
tion may not be available at the time of scheduling. On the other hand, based 
on the number of atomic transactions in a process, we can always determine the 
relative duration of each of the processes. 

In the foregoing discussion, we make two major observations: 

— There are instances when the duration information is implicit, but should 
not be ignored. If it is ignored, it may lead to wrong conclusion. 

- There are instances where the relative duration instead of the absolute length 
of the interval, is more relevant. 

The existing frameworks of Temporal CSP do not model this aspect. The 
expressive power of lA and PA is limited, as it is not possible to represent 
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qualitative information of interval duration along with other relations. The bi- 
network based approaches put unnecessary burden on computational resources. 
Keeping these in mind, we attempt to answer the question that whether an lA- 
like framework can be designed to represent qualitative information of intervals 
and duration. Of course, we expect such a system to preserve the simplicity 
and elegance of the lA. We also hope that the specialised algorithms (compu- 
tationally efficient) developed over the years for lA based reasoning systems (in 
various domains) could be easily adapted for the proposed extended framework 
of temporal reasoning. 

In this paper, we gracefully extend the lA to model qualitative information 
about intervals and durations in a single binary constraint network and we term 
this network as JA/’PM-INterval and DUration network. TNVU comprises of 25 
basic relations between a pair of two intervals. We discuss various representations 
(ORD Clause, Geometric and Lattice) of XAfT>U and characterise its subclasses, 
namely, convex and pre-convex classes. Our main result is to show that this 
network exhibits similar characteristics as the lA network in the sense that for 
the pre-convex classes consistency checking can be carried out by polynomial 
algorithm. But the level of local consistency necessary for the global consistency 
of JA^Dif-network is 4 contrary to the 3-consistency in the case of lA network. 
We prove that if the network is strongly 4-consistent and the relations are in Pre- 
convex class, then the network is globally consistent. Hence, the feasible schedule 
of the network can be found in a polynomial algorithm for these classes. 

2 Extension of Interval Algebra: XM'DU. 

We begin with 13 basic relations of lA [1], I = {b,bi,m,mi,o,oi,s,si,d,di, f, 
fi,eq}. Intuitively, 7 of these 13 relations between any two intervals implic- 
itly represent relations between their corresponding durations. These include 
{eq, s, si, d, di, f, fi}. However, in case of 6 remaining relations, i.e., {6, bi, m, mi, 
o,oi}, nothing is known about durations of the respective intervals. There are 
only three possible ways durations of these intervals can be related, {<,=, >}. 
Therefore, we need 18 (6 x 3) relations to express relationships between intervals 
and durations with respect to the above six relations. Hence, we need a total of 
25 basic relations to adequately model qualitative information about intervals 
and durations. These relations have an important property that they integrate 
the qualitative information of interval duration along with its relative position 
on time scale. The relations are denoted as 

E — {b"', b^, b^, bi*~, bi^, bi^, m‘^,m^, m^, mi"^,mi^, mi^ ,o^ ,o^ ,6^ ,oi^ , 
oi~ , ov' , s^ , si^ ,d^,di^,f^, fp" , eq~}. 

For any two intervals, there can be exactly one of the set of 25 atomic rela- 
tions. Let A*’, A®, and A*^ denote the start point, the endpoint and the dura- 
tion, respectively, of any interval A. Then, these relations can also be expressed 
as set of inequalities in terms of the end points and duration. For example, 
A5<y = A® < y*’AA‘^ < y'^ (see Table 1). The indefinite qualitative temporal 
information can be expressed aa disjunction of the atomic relations and can be 
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represented as a subset of E. For example, The expression X{6^,o^}y means 
that Xb^YVXo^Y. Thus, there can be 2^® = 33554432 possible binary relations 
between any pair of time intervals. This set of relations is denoted by IMVIA, a 
rich language for representing temporal knowledge. 
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Table 1: 25 basic relations and the schematic diagram 



3 Different Representations of XJsfT>l4. 

There have been many different methods of visualising the qualitative relation- 
ships of intervals as proposed in lA. These include (i) expressing the relations 
in XHVU as ORD-clause form [9]; (ii) geometrically representing the intervals 
in two-dimensional planes and considering the relations as admissible regions 
in this plane [6], [10] and (iii) representing the relations as a lattice [6]. These 
interpretation provide a rich insight and better understanding of the expressive 
power of the language. 



3.1 ORD-Clause Form 

Nebel et al. [9] introduce ORD clauses and show that all interval formulas of 
lA can be translated into their equivalent ORD-clause form. In the similar line, 
the relations in TMTU can be equivalently expressed as ORD-clause where the 
literals involve not only the end point parameters but also the duration vari- 
ables X^ and Y’^. We give below the relevant concept as introduced in [9] but 
suitably modified for the present context. 

Similar to the traditional definition, by a clause we mean a disjunction of 
literals. However, by a literal in this case we mean an atomic formula of one of 
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the following forms: a = b,a < b,a ^ b,a ^ b, where a and b are X^,X^ or X<^ 
for some time interval X. The literals, a = b and a < b are termed as positive 
literals and the other two literals a^b and a ^b are negative literals. It may be 
noted that the relation a ^ b can be equivalently written as two clauses a ^ b 
and b < a. A formula is essentially a conjunction of set of ORD clauses. 

Definition 1 (ORD-Clause Form). The ORD-clause is a clause containing 
literals of the following forms, a = b, a < b, a ^ b, where a and b are X^, AT® 
or X®* for some time interval X. ORD-clause form of an interval relation is the 
the clause form containing only ORD-clauses. ORD Horn clause formula are the 
ORD clause forms admitting at most one positive literal. 

The following result can be inferred directly from definitions. 

Theorem 1. Any relation in IMVU can be equivalently represented in ORD- 
clause form. 



Example 3. Though every relation can be equivalently expressed as ORD-Ciause 
form, it is not possible to express every relation in ORD-Horn form. For instance, 
the relation X{bi^,oi^)Y cannot be represented as ORD-Horn form. Its equiv- 
alent expression is {X<^ < V F® < X'>) A (F<^ < V < F®) A {X^ 
Y‘^) A (X*" ^ F®). On the other hand, the relation X{bi^,bi^ ,oi^,oi~,oi^)Y 
can be represented as ORD-Horn clause as the following: 

(F® < X®) A (F*” < X*") A (F® ^ X®) A (F*" ^ X*") A (F® ^ X*") A 
(X* < F® V X‘‘ F*^). 



3.2 Geometric Representation of XA/DU 

It is useful to visualise the relations of lAfDU as regions in the Euclidean plane. 
An interval which is an ordered pair of real numbers (X^, X®) can be represented 
as points in 3?^. Since X*" < X® holds for an interval (X*’,X®), the space of all 
time intervals can be identified with the half plane H defined as the upper half 
of to the left of the line X^ = X®. Let A = (A^’, A®) be a fixed interval with 
the duration A^. Then for each atomic relation r £ E, there is a well-defined 
region in the half plane 7i which consists of points X = (X**, X®) related by r 
to A. The regions associated with all atomic relations are shown in the Fig- 
ure 1. More generally, the region associated with a non-atomic relation is the 
union of the regions associated with each of the constituent atomic relations. 
Dimension of a relation is defined as the dimension of the associated region. The 
relations b"^,b^ ,o^,o^ ,oi"^,oi^ ,bi“^, bi ^ , d^ and di^ are of dimension 2. The re- 
lations m^,m^,s^,fi^,f^,si^,o~,oi^,mi^,mi^,bi^ and b~ are of dimension 
1. The relations eq~,m'^,mi~ are 0 dimensional regions. We shall be making 
use of this representation throughout the work. 




296 Arun K. Pujari et al. 




Fig. 1. The admissible regions 
of Y for a given X when YrX. 




Fig. 2. Lattice Representation 



3.3 Lattice Representation of TJsf'DU 

The relations in lA are represented in the form of a lattice to provide a different 
insight [7] and it helps in defining convex, pre-convex and other relations. These 
relations possess some interesting computational properties. In the similar line, 
we represent the atomic relations in XMVU as a lattice. Any atomic relation r 
between interval variables X and Y is given by a triple of numbers (mi, m2, m3) 
as follows. If r*’ < X\ then mi = 0; If F*’ = then mi = 1; If X*” < F'’ < AT®, 
then mi = 2; If F*" = A® then mi = 3; and If F*’ > A® then mi = 4; 

The m2 is similarly, 0, 1, 2, 3 or 4 depending on whether the relation corre- 
sponds to F® < X^F® = < F® < X®,F® = X® or, F® > X®, respec- 

tively and the m3 is 0, 1, or 2 depending on whether F*^ < X®*, F*^ = X*^, or 
Y<i > respectively. Thus for each of the 25 relations in E, coded as a triplet, 
is a point in the lattice and the lexicographic ordering of these triplets induces 
a partial order in the lattice, (see Figure 2) Precedence in the lattice is defined 
component-wise: (mi, m2, m3) < (ni,n2,n3) if and only if mi < ni V m2 < 
ri2 V m3 < U 3 . A non-atomic relation can be represented as a subset of the 
lattice. 



4 Subcl 2 isses of XJsf'Dtl 

We use the lattice representation of TAfVU to define convex, pre-convex and 
other relations. The motivation behind identifying these relation is that the sim- 
ilar relations in lA provide insight into understanding of tractable classes [7]. For 
any two atomic relations r and s such that r < s in the lattice, [r, s] denotes the 
set of all relations which are between r and s in the lattice. For instance, [6*^ , o~] 
represents the set of relations between 6^ and or in the lattice. 6^ has the lattice 
code (0,0,0) and 0= corresponds to the code (0,2,1). Hence the elements which are 
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between these two elements in the lattice are (000), (001), (010), (Oil), (020),and 
(021). The corresponding relation in IMVU is {b^, b~, m^, m^, o^}. We call 

this as an interval in the lattice. 

Definition 2 (Convex Relation). A relation inlAfVU is said to be a convex 
relation if it corresponds to an interval in the lattice and is of the form [r, s] for 
some r,s € E. 

In geometric interpretation, a convex relation is a convex region in H. The convex 
relations when translated into ORD-clause form, yield formulas containing only 
unit clauses. Define C C lAf'DU as the set of all convex relations. Enumeration 
reveals that there are 227 convex relations in XAfVU, that is \C\= 227. 

Theorem 2. For an interval relation c € XNVU between any two intervals X 
and Y following are equivalent. 

1. c is a convex relation. 

2. The restriction imposed by X on the domain ofY is a convex region in H. 

3. The equivalent ORD Clause form ofc in terms of the endpoints and durations 
of X and Y contains only unit clauses. 



Definition 3 (Pre-convex Relation). Pre-convex relations are those rela- 
tions which are obtained from a convex relation by removing zero or more number 
of atomic relations of lower dimensions. 

In other words, if the convex relation corresponds to a region of dimension 
2, then the pre-convex relations are obtained by removing atomic relations of 
dimensions 1 and 0, but if the convex relation is of dimension 1, then we obtain 
the pre-convex relations by removing only atomic relations of dimension 0. Ob- 
viously, any convex relation is a pre-convex relation. Geometrically, the region 
of a pre-convex relation is a convex shape with linear or point discontinuities. 
Define PC C XAfVU as the set of all pre-convex relations. Enumeration reveals 
that there are 77223 pre-convex relations in XMVU. 

Definition 4 (Convex Closure). For a XAfPlA -relation r, a convex closure is 
the smallest (in terms of set inclusion) convex relation that contains r. 

From any pre-convex relation, we can obtain its convex closure by restoring the 
smaller dimensional elements that are deleted to get the relation. 

Theorem 3. For an interval relation c € XNX)U between any two intervals X 
and Y following are equivalent. 

1. c is a pre-convex relation. 

2. The restriction imposed by X on the domain ofY is a convex region in 7i 
with linear or point discontinuities. 

3. The equivalent ORD Clause form ofc in terms of the endpoints and durations 
of X and Y contains only ORD-Hom clauses. 
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Example 4- The relation X(bi'^ ,bi^ ,oi^ ,oi^ ,oi^)Y is a pre-convex relation. Its 
clause representation is (Y^ < X®) A (Y^ < X^) A (T® ^ X®) A (Y^ X**) A 

(y® ^ X**) A (X** < y® V X’^ ^ y*^). it is obtained from the lattice interval 
by removing the elements and , out of which mi~ 

is 1-dimensional and others are 2-dimensional. In 9?^ the admissible domain of Y 
for a fixed instantiantiation of X is the region shown Figure 3. On the other hand, 
{h^,b~,m^,o^,s'^,eq~} is not a pre-convex relation. 

5 Reasoning in XMVU 

Constraint network is a simple way of representing the set of variables with 
specified domains and the constraints between them. Hence, any temporal in- 
formation with interval variables and the binary relations as elements of XMVU 
can be represented as a constraint network to facilitate reasoning with intervals 
and duration. The main reasoning problem of a constraint network is to deter- 
mine whether the network is consistent and if consistent, then to determine the 
feasible relations and the minimal network. We study this aspect of XAfVU in 
this section. 

Definition 5 (JA/’X>W-Network). XJ\fX>U -network is a binary constraint net- 
work consisting of a set of n interval variables /i, / 2 , . . . , /« and a set the binary 
constraints between variables. These binary constraints are the interval relations 
inXNVU. A convex XMVU -network is a IMVU -network having all its relations 
in C. Similarly, a pre-convex XJXVU -network has all its relations in VC. 

An instantiation of an interval X is an assignment of X to a point in “H. 
A consistent instantiation of X and Y is the instantiation of X and Y such 
that the binary relations between them is satisfied. In a network, the consistent 
instantiation of the all interval variables is called a solution. Local consistency 
has proven to be an important concept in the area of constraint network. A local 
consistency algorithm helps in a sort of pre-processing the network for a efficient 
running of the backtracking algorithm. Moreover, for certain subclasses of the 
problem, local consistency suffices to determine global consistency. We analyze 
below such properties for convex and pre-convex class of XAfDU- network 

Definition 6 (fe-consistency [4], [12]). A network is said to be k-consistent 
if and only if for any instantiation of any k — 1 variables satisfying all the direct 
relations among those variables, there exists an instantiation of the kth variable 
such that the k values taken together satisfy all the relations among them. A 
network is strongly k-consistent if and only if it is j -consistent for all j < k. 

We call the network to be globally consistent when it is strongly n-consistent. 
Any globally consistent network is necessarily minimal, but minimal network 
need not be globally consistent. 

Theorem 4. Consider a convex XAfTyU -network with n interval variables 
such that the intervals 7i, J 2 , . . . be consistently instantiated as 
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^11,^2) • ■ • ) ^(n-i)- /of all 1 < p,q,r < {n — 1) the given instantiation Ap, Ag 
and Ar can be extended consistently to an instantiation of In, then the instanti- 
ation Ai, A 2 , ■ . ■ , A(^n-i) aan be extended to a n-tuple of consistent instantiation 
of all the n intervals. 

Proof. The binary relations r^n, restrict the permissible domain of In for the given 
instantiation Aj. Since the relations are convex, these admissible domedns, Rjn, are 
convex sets in 3?^. As a direct consequence of Kelly Theorem, if any three of these 
convex regions have non-empty intersection then all the (n — 1) regions have a common 
non-empty region. Hence, the given instantiation can be extended to a n-tuple which 
is a consistent instantiation of all the n veiriables. 



Theorem 5. A convex 4-consistent IAfT>U -network is globally consistent and 
minimal. 

Proof. The theorem is proved by showing that if the network is strongly 4-consistent 
and all of constraints are convex constr 2 iints then the network is fc-consistent for all 
k < n. Hence the network is strongly n-consistent auid therefore, minimal. This is 
straight forward from the above theorem. 

Interestingly, Theorem 4 can be extended to the pre-convex relations. The main 
result of this work is to establish that for pre-convex ZA/'PW-network, the net- 
work is globally consistent if the network is strongly 4-consistent. The proof is 
based on some of the following smaller results. 

Lemma 1. If for two intervals Ii and I 2 in TAfVU -network, such that the rela- 
tions rik and r 2 k are 1-dimensional and their corresponding linear regions Rik 
and i? 2 fc are contained in the same line (they may not be same line segments, 
but they are portions of the same straight line), then the Ai and A 2 satisfy a 
1-dimensional or 0-dimensional relation in ri 2 - 

Proof, (sketch) The proof is straight forward once we graphically interpret the 
statement. In Figure 4, if for a given instantiation A\, the region Rin is a line 
segment P and similarly, for the given A 2 , the region /? 2 n is a line segment Q 
such that both i?i„ and i? 2 n are parts of the same straight line UV, then A 2 
must lie on the vertical line UV or the horizontal line UW. 



Corollsiry 1. If for two intervals I\ and I 2 in XAIVU -network, such that the 
relations rik and r 2 k are 0-dimensional and their corresponding regions Rik 
and i? 2 fc are points, then the Ai and A 2 satisfy ru such that Af = A^. 



Theorem 6. For a path consistent convex XAfDU -network, such that for the 
given consistent instantiation Ai and A 2 , the regions PI i?2n = P where P 
is a line segment then r \2 is a 2-dimensional non-atomic relation or both Rin 
and f? 2 n are lines. 
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Fig. 3. The admissi- Fig. 4. A 2 ’s admissible domain 

ble region of Y when falls on a line (horizontal or ver- 

X {bi'^ , bi ^ , oi^ , oi~ ,oi^)Y. tical) when Rin and i? 2 n are on 

the same line 

Froof. For the given instantiation, Ai and A 2 , if Rin n R 2 n = P, then P lies in 
the boundary line where both the regions meet. This is possible only when there is 
am atomic relation in rin (and in r 2 n) which corresponds to a line segment colinear 
with P. By the above theorem, Ai aind A2 instantiate a 1- or, 0- dimensional relation. 
Unless both Ri„ and /? 2 n are lines, for a (x,y) in P, we can always select a point 
B = {x',y') in neighbourhood of (x,y) such that B 6 Pin, but not in P 2 n- Thus, Ai 
and B aire consistent instantiation of h amd /„. But if I 2 is to be consistent with these 
then it hais an instantiation A!^ satisfying all the direct relations with 1\ and In- This 
can happen only when the relation ri 2 is necessarily two-dimensional. 

Definition 7 (Maximal instantiation). The interval variables R and Ij are 
said to maximally instantiated to Ai and Aj, if Ai and Aj satisfy the relation of 
maximal dimension of non-atomic relation rij between R and Ij . 

Theorem 7. If r is a strongly 4-consistent JAfDU -network such that all the 
relations are pre-convex then it is globally consistent. 

Proof, (sketch) Let us taike any k intervals l\,l 2 , ■ . ■ ,Ik and let the intervals 
7i, / 2 , . . . ,/(fc_i) be maximally and consistently instantiated as Ai, 1 < i < (k — 1). 
The main line of argument is directed towards reeiching a contradiction on maximality. 
Assume that Pi, 1 < i < (fc — 1), be the pre-convex admissible domains of Ik associated 
with rik, 1 < t < (fc — 1). 4-consistency ensures that for any three, p,q,s such that 
1 < p < g < s < (fc — 1), we have Pp fl P, n P, 0. Assume that P|i=i^ Ri = 0. Let Ri 
denote the convex closure of Pi. Clearly, 

PpflP, nPs^0 V l<p<q<s<(fc — 1)=> 

k-l 

Rp n Rq n Rs ^ <h \/l<p<q<s<{k — !)=>■ 

i=l 
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Let the common region of Ri, 1 < i < (A: — 1) be P. Since P is not in all of FU, 
but is common to eiU Pi’s, P is pau-t of the additional region that is eidded to obtain 
the convex closure for some pre-convex relation. Hence, P is of dimension 1 or 0. We 
cam consider the two cases separately, namely, (i) P is a line segment and (ii) P is a 
point. If P is a line segment then the maximal relations aire necessarily 2-dimensionaJ 
but these are not instamtiated. Hence a contrauiiction. Similarly if P is a point, the 
maximal relations are 1-dimensional, but the given instantiation for which P is the 
common regions requires that these aire 0-dimensional instantiation. Thus in either 
case we reach contradiction on maximality to complete the proof. Hence, the network 
is fc-consistent. 

Thus the consistency of an XA/’PW-network can be checked in polynomial 
time. 

6 Algebraic Properties 

We discuss in the earlier section that if the network is 4-consistent then it is 
globally consistent and is also minimal. In order to have a reasoning system in 
the present framework, it is necessary to have a method of deciding whether 
a network is 4-consistent and if not, to enforce 4-consistency by the process of 
constraint tightening. In order to accomplish this task, we need certain algebraic 
operation for XMVU. The atomic relation in E can also be viewed as an ordered 
pair (r, i) where r € / and F £ D. Thus E C I x D where D = {<, >, =}. Three 
atomic operations, namely, unary converse (denoted by binary intersection 
(denoted by D) and relational composition (denoted by (g) ) can be defined on 
IMVU as follows: 

Definition 8 (Converse). For any atomic relation ^ = {r,i), the converse 
of ^ denoted by is given by = {r"" ,t"). The r'~' and have the usual 
meaning of converse as in I A and PA. 

For any non-atomic relation r = • • • ,^n} in TNVU, the converse of r 

is r'^ = The intersection of any two relations r and s in 

IMVU can be expressed as the set-theoretic intersection of the set of atomic 
relations. 

Definition 9 (Intersection). If r = {6,^2, ■ • ■ ,Cn} and s = {pi,/?2, • • ■ ,Pn} 
then rC\s is the set intersection {^1,^2, ■ • ■ ,4n} H {pi, Pn}- 

Definition 10 (Composition). For any two atomic relations = rj’, and 
f,2 — ’"2*1 the composition r = <S> ^2, is given by the set of ordered pairs in the 

cartesian product ((ri ®/ T2) x {£i £2)) H E, where (ri (g)/ r2 and £\ <S)p £2 are 

the compositions in lA and in PA, respectively. 

Theorem 8. IMVU is closed under converse, intersection and composition. 

Proof is straight forward from the definition. IMVU is an algebra with these 
operations. 
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7 Conclusion 

In the original proposal of lA [1] suggested that duration information can be en- 
coded in a network orthogonal to the interval relationships network. Allen called 
this a duration reasoning system. There are other works PDN [8] and APDN [14]. 
The PDN is a bi-network framework, representing qualitative information points 
and durations. It can only handle pointisable (SIA) relations and to represent 
the interval information, PDN translates it to equivalent point information. This 
increases the complexity of the problem. APDN follows the same structure but 
extends PDN to represent quantitative information as well. Koubarakis [5] also 
attempts to represent relations between durations of intervals. However, this 
work addresses a restricted class of temporal constraints. Recently, a unified 
framework for quantitative and qualitative reasoning is proposed as PIDN [11]. 
PIDN handles the interval and duration information in a much more general 
sense. But it does not possess the elegant computational properties of TMVU. 
The IMVU elegantly extends I A, and all existing efficient methods for I A can 
easily be extended for XJ\f'DU. In a forthcoming paper, the authors propose to 
report on algorithmic characteristics of XAf'DU. 
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Abstract. The rules of inductive inference are formalized using a transi- 
tion rules. The rejection of a consequence obtained by inductive inference 
is formalized by a revision rule. An inductive process is defined as a se- 
quence of versions of a theory generated by alternatively applying the 
inductive inference rules and the revision rule. An inductive procedure 
scheme is constructed. It takes a sequence £m of insteinces of a given 
model M and a given formal theory T as its inputs, and generates the 
inductive processes. It is proved that if Sm contains all instances of the 
model M, then every inductive sequence generated by the procedure 
scheme is convergent. Its limit is the set of all true statements of the 
model M. 

Keywords: inductive reasoning, belief revision, knowledge represents^ 
tion and reasoning, inductive process 



1 Deduction and Induction 

It is well known that inductive inference is different from deductive inference. 
Generally speaking, deductive inference deals with how to get a particular conse- 
quence from a general statement while the concern of inductive inference is how 
to obtain regularities from a particular result or instance or obtain a sufficient 
condition from the necessary conditions. 

A rigorous mathematical theory of inductive inference and their rationality 
is set up. It is based on the classical proof theory and model theory within 
the scope of first order language. The basic ideas of this paper in dealing with 
inductive inference can be expressed as follows: 

1. The sequences of versions of formal theories. One of the fundamental dif- 
ferences between inductive inference and deductive inference is that inductive 
inference promotes the evolution of a theory and involves the growth of knowl- 
edge, while deductive inference is essentially the treinsformations of tautology, 
or, in other words, the conclusion deduced is implied in the premises, hence no 
substantial growth of knowledge is incurred. Let h denote deductive relation, F 
denote a formal theory, and Th{F) denote the theoretical closure of F under 
the rules of deductive inference (for example, Gentzen rules). Let => denote 

N. Foo (Ed.); AI’99, LNAI 1747, pp. 304-315, 1999. 
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inductive relation (or evolutionary relation). This difference can be described by 
the following forms: For the deductive inference, we have 
If r\-A, then Th(r) = Th{{A}U T). 

But in the case of inductive inference, the information is increasing. That is 
to say: 

If r^A, then Th{r) C Th{{A} U T) 

A new formal theory is generated after a rule of inductive inference is applied. 
To put it more precisely, a new version of the theory is generated after a rule of 
inductive inference is applied [1]. If Fn is used to denote the version of a given 
formal theory F, then an inductive process can be naturally described by the 
sequence of the versions of the theory according to the order of their appearance. 

Where T) => Fi+i, i.e., Fi+i is obtained by applying a rule of inductive 
inference to J). 

In “Topica” of his famous “Organon” , Aristotle stressed: “induction is a pas- 
sage from individuals to universals” [2]. This statement has a dual implication. 
One is that the inductive inference proceeds from a particular fact to a universal 
statement, and the other is that the induction is a sequence which describes 
the growth of knowledge from individuals to universals. 

2.The orthogonality of decuctivity and inductivity. If we use tree structures to 
denote the theoretical closure of a theory under the deduction. The relation 
between inductive inference and deductive inference can be expressed by the 
following diagram: 




The above simple diagram describes a certain “orthogonality” of the induc- 
tive inference and deductive inference. It is the starting point for this paper 
to set up the theory of inductive inference. The similar diagram was used by 
Plotkin [3] to describe the relation between program executions and their proofs 
in structured operational semantics. 

3. The refutations of inductive consequences. Another fundamental difference 
between inductive inference and deductive inference lies in the fact that deduc- 
tive inference is sound, i.e., the consequence obtained by applying some rules of 
deductive inference is always true provided the premises are true. But it is not 
the case with inductive inference. The inductive consequence may not necessarily 
be true, it can wrong in many cases. 

As a matter of fact, the truth or falsity of a consequence derived by inductive 
inference can only be examined by the presence or absence of counter-examples. 
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which run against the consequence. An inductive consequence is false if it is 
refuted by some counter-examples. Therefore, in no sense can inductive inference 
be independent of refutation. Inductive inference and refutation are opposite 
and complementary to each other. Since the word “refutation” is used in many 
publications to denote different things, in this paper, we use “the rejection by 
facts” to avoid confusion [4]. 

4. The context sensitive character of inductive inference. We use the following 
form: 



r r'. ( 1 ) 

where F' is a new version of F after a certain rule of inductive inference is 
applied. In order to make the inductive inference meaningful, the consistency 
between the inductive consequence and its premise is required. Thus, F' should 
be consistent with F. This means that the rules of inductive inference are formed 
by the context-sensitive grammar. 

5. The rationality of inductive inference. The paradigm of inductive inference 
given above is the followng; Suppose that the world which we study is described 
by a model M, and that Em is a set of instances of M and Fq is the initial version 
of a theory. The goal is to induce a theory from Em using inductive inference, 
and to guarantee that this theory takes all true statements of the model M as 
its logical consequences. 

We reach this goal by the following approach: We begin with Jb) and repeat 
the following operations: At each time, we pick one instance A from fjvfiand then 
check the current version Fn using A; and generate a new version Fn+i either by 
applying some inductive inference rules if A is a positive instance or by applying 
revision rule if A denotes a rejection by facts of T„. By repeatedly performing 
the operations, a sequence of versions of the theories is produced. Since, for each 
particular time, the new version generated by inductive inference, may rejected 
by some instance later, it is hard to say the rationality at that particular time. 
The rationality is meaningful only in the case that the sequence of the versions 
of the theory is convergent, and its limit is the set of all true statements of the 
model M. 

2 The Rules of Inductive Inference 

A first order language L is assumed, which is formally defined, for example, 
in [7]. CN, FN and PN are used to denote the sets of constants, functions 
and predicates of L respectively, and CN is a countably infinite set. A, F and 
Th{F) are used to denote a formula, a set of formulas and the theoretical closure 
of F under the deduction (for example, Gentzen style) rules. CN{F),FN{F), 
and PN{F) are the sets of constants, functions, and predicates of F. F h A 
is called a sequent and denotes the A is deduced from F according to Gentzen 
style deduction rules [11]. Here, T is a sequence, similarly, A{t),'ix.A{x),F and 
A,AdB,B,F are formal theories and are sequences. They are also taken as a 
set of formulas when it is required [11]. 
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A model M is a pair of < M, I >, where M is a domain, I is an interpretation. 
Sometimes, is used to denote a model for a particular problem p. Tm^ 
denotes the set of all true sentences of Mp and is a countable set. To simplify 
the proofs, it is assumed that Tm^ has the built-in Skolem functions [8]. T |= A is 
used to mean that A is a semantic (or logical) consequence of F [7]. The validity, 
satisfiability, and falsifiability of deduction of the first order logic are defined as 
in [7]. Therefore the deduction rules are sound and complete. 

To formalize the rules of inductive inference, we need the Herbrand universe 
defined below [7]: Let 

Hq = {<^0) ■ ■ ■ > 0,ni ■ ■ ■ |®n € CN} 

Ki+i = Hi U {rih, • • •,*„)]<, € H„ r 6 FN} 

H =U=oHi 

H is called a Herbrand universe of L. The transition: 

cond{F,r') 

r^F' 

is used to represent the rules of inductive inference, where F and F' are formal 
theories. cond(F, F') is a condition that F and F' must satisfy. The rule can 
be informally interpreted as: If the cond(F, F') holds, then F becomes F' after 
a rule of inductive inference is applied. The inductive inference rules are given 
below: 

Definition 1. Inductive inference rules 
Inductive Extension 

P ^ PN{F) t e H 
F^P{t),F 

P ^ PN{F) t e H 
F^^P{t),F 

Inductive Generalization 

-A(t') ^ Th{F) t' 6 H 
A{t),F^ A(t),Vx.A(x),r 

Inductive sufficient Condition 

-.A ^ Th{{B} U F) 

Ad B,B,F => A,AdB,B,F 

Informally, the inductive extension rule means that a new predicate can be 
added in a theory; the inductive generalization rule says that a particular in- 
stance can be generalized to a universal statement; the inductive sufficient con- 
dition rules allows to obtain a sufficient condition from a necessary condition. 
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Lemma 1. If F' is obtained from F by applying an inductive inference rule 
given in definition 2.1, then F' is consistent with F. 

Proof: Straightforward from the the definition of the rules. □ 

The lemma shows that the rules of inductive inference are in the category 
of context sensitive grammars. Fn and F' are called a version of theory F. A 
version of a theory is usually viewed as a consistent set of non-logical axioms of 
a particular problem. 

3 Rejetion by Facts and Revision Rule 

Definition 2. Rejection by facts 

Let T A. The model M is called a rejection by facts of F of A, if M |= -lA 
holds. 

Let Fm{a) = Mi € T, M 1= Aj, M |= -lA}. 

M is called an ideal rejection by facts of A iff Fm(a) is maximal in the 
sense that there does not exist another rejection by facts of A, M', such that 
Lm(A) C Fm'{a)- An ideal rejection by facts of A is denoted by M(A). Let 

7l(r, A) = I M if an ideal rejection by facts of A} 

The ideal rejection by facts satisfies the Occam’s razor, which says: Entities 
are not to be multiplied beyond necessity[4,6].” 

Definition 3. Ideal revision 

Let r h A. A theory A is called an ideal revision of F by -lA, if it is a 
maximal subset of F and is consistent with -lA. 

Let A{F, A) be the set of all ideal revisions of F by -lA. 

Theorem 1. A{F,A) =TZ{F,A). 

Proof: Straightforward. □ 

Example 1. Let F = {A, A D B,B D C,E D F}. Obviously, F C holds. 
According to the definition, A{F,C) consists of the following three maximum 
subsets of F: 

{A,Ad B,Ed F}, {A,BdC,Ed F}, {Ad B,B D C,E ^ F}. 

A revision rule can be described as follows: 

Revision rule 



AeTh{F) AeA{F,A) 

F^’- A 

The rule means that when a consequence deduced from theory F meets a rejec- 
tion by facts -lA, a new version of F is an ideal revision of F by ->A. The concept 
of reconstruction tells how a new version of a theory can be constructed: 
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Definition 4. Reconstruction 

Given a theory F and a sentence A, a theory F' is called a reconstruction 
of F for A, if F' is defined as below: 

If r 1= A, then F' is F itself; if r" |= -lA (-i^ meets a rejection by facts), 
then F' \s A, A ^ T^(F, A)\ otherwise, F' is obtained from F by applying a rule 
of inductive inference to F and A. For the sake of simplicity, they are called 
E-reconstruction, R-reconstruction, and I-reconstruction of F respectively. 

If A is an R-reconstruction of F for A, then 4 is a maximum subset of F and 
is consistent with -^A. Therefore, R-reconstruction is not unique. The similar idea 
of R-reconstruction is given in [5], where it is called the maxichoice contraction. 
The minor difference is that the maxichoice contraction is defined for the the- 
oretical closure Th{F) in propositional logic, and focused on its proof-theoretic 
aspects called postulates. In this paper, we are interested in its model-theoretic 
aspects and the interaction between a theory and its environments. We will focus 
our attention on the convergent sequences of versions of a theory[10]. 

4 Sequences and Limits 

Definition 5. Sequence 

A sequence A, T 2 , • • • • • • is called a sequence of theories, or sequence for 

short, if Fn is a theory for all n > 1. It is increasing if A C A+i for all n; it is 
decreasing if A D A+i for all n; otherwise, it is non-monotonic. 

It is assumed that P and Q are the same sentence if P = Q i.e., (P D 
Q) A{Q D P) is a tautology. 

Definition 6. Limit of a sequence 

Let {Fn} be a sequence. The set of sentences: 

oo oo 

= n u 

n=l m=n 

is called the upper limit of {Fn}- The set of sentences: 

OO OO 

^ u n 

n—l m=n 



is called the lower limit of {Pn}- 

A sequence is convergent if and only if F» — F*. Its limit is lower (upper) 
limit of the sequence and is denoted as lim„_oo P„. 

The following lemma tells the meaning of the lower and upper limits 

Lemma 2. 1. A s P, if and only if there exists N > 0 such that A 6 Fm holds 
for m> N. 
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2. A E r* if and only if there exists a sub-sequence {kn}, such that A S 
holds for 1 < n < 00 . 

Proof. Straightforward from the definition. □ 



Lemma 3. Every increasing sequence {Pn} is convergent. Its limit is U^i ^n- 
Every decreasing sequence {En} is also convergent. Its limit is H^i 



Example 2. Increasing sequences 

The increasing sequences play an important rule in the proofs of some theories 
of mathematical logic. For example, Lindenbaum theorem says: “Any formal 
theory of the language L can be extended to be a maximum theory.” Its proof 
is as follows: Since the set of sentences of L is countable, it can be given in the 
order: Ai, A 2 , • • • , A„, • • •. Let Eq = E, 

_ ( En LK-^n} if E„ and A„ are consistent 
~ I E„ if En and A„ are not consistent 

Here, {E„} is an increasing sequence, its limit U^o -f" ^ maximum theory. 



Example 3. Divergent sequence 

P _ { {A} n = 2k-\-l 
" ^ H = 2A: 



We have E* = {A, -iA} and E* = 0. {E„} is divergent. 



Example 4- Ramdom sequence 

Let A be: “Tossing a coin, and getting a tail’ and E„ denote the result of 
the toss. {E„} is a random sequence of A and -lA, and is divergent. 

5 Complete Inductive Processes £uid Their Limits 

Definition 7. Inductive processes 

The sequence Ei, E 2 , ■ ■ • , E„, • • • is called an inductive process or inductive 
sequence if Ej+i is a reconstruction of Ej, where t > 1. 



Theorem 2. 1. An inductive process {E„} is strictly increasing if and only if 
for every n > 1, E„+i is an I-reconstruction of E„ for some A. 

2. An inductive process {E„} is strictly decreasing if and only if for every n > 1, 
E„+i is a R-reconstruction of E„ for some A. 

3. An inductive process {E„}is non-monotonic if and only if I-reconstructions 
and R-reconstructions appear alternatively. 
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Proof: Straightforward from the definition. □ 

It is assumed that one is always able to decide whether an instance is true 
or false. Thus, the instances can be described using the Herbrand universe [7] 
discussed in section 2: 

Definition 8. Complete instance sequence of a model M 
Let Po = {Pq, • • • , Pn, ■ • • |Pnis a 0-ary predicate or its negation} 

Pj+i = Pi U • • ■ ,tn)\tj G Hi,P” is n-ary predicate or its negation} 

p =U=iPi 

An enumeration of the set is called a Herbrand sequence of L. 

Let Mp be a given model of a particular problemP. The interpretation of 
the Herbrand universe of L under Mp is called the Herbrand universe of Mp, 
and is denoted by Hmp • The interpretation of a Herbrand sequence of L of Mp 
is called a complete instance sequence of Mp, it is denoted by Sm> sometimes by 
{Ato} if no confusion can occur. A sub-sequence of a complete instance sequence 
of M is called a sample sequence of M. 

For a given model M, the interpretation of every complete instance sequence 
is determined. On the contrary, when any complete instance sequence is deter- 
mined, the model is determined. The following lemma will be used latter. 

Lemma 4. (T/i(P)). = Th{{Th{P)%) 

Proof: Straightforward from the definition. 

Suppose that we know Em = {Am} which is a complete instance sequence 
of a given model M, and it is the only thing which we know about M. Suppose 
that P is an initial version of a theory. Our goal is to obtain all true sentences 
of M by using an inductive procedure which takes Em and P as its inputs. The 
basic idea of this procedure is the following: 

Let Pi = P. Pn+i will be defined as follows: 

1. If Pn b Ai for some i, then Pn+i = Pn! 

2. If P„ I — I At, P„+i is a R-reconstruction of Pn by -lA. In this case, -<Ai has 
met a rejection by facts Aj, and Ai must be accepted. 

3. If neither 1 nor 2 can be done, then Pn+i is defined by the induction rules 
given in section 2 as below: 

(a) If At = B{t) and the inductive generalization rule can be applied, then 
Pn+i is {At,Vx.B(a:)} U P„; 

(b) Otherwise, if Aj = P and exists A, such that P„ h A D P, and the 
inductive sufficient condition rule can be applied for A, then P„+i is 
{A,AdP,P}UP„; 

(c) If neither case (a) nor case (b) can be done, then P„+i is {Aj} U P„. □ 
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In the above procedure scheme, for every case, Fn+i (or ^^+ 2 ) contains Ai. 
This is because that Ai € £m Q Fm, and it is necessary that the inductive 
sequence {r„} will converge to Tm- The following example shows its necessity: 
Assume that £m = {A(ci), -iA(c 2 )} is a complete instance sequence of a given 
model M. Let T = 0. If we do not allow that .Tn+i must contain Ai, then 
according to the above procedure scheme, the generalization rule is applied to 
A(ci), and Fi becomes (Vx.A(x)}. But then, -iA(c 2 ) becomes a rejection by 
facts of Fi. Since the R-reconstruction of A by ~iA{c 2 ) is the empty set. Thus, 
A(ci) is lost. 

When an R-reconstruction is taken in response to a rejection by facts in 
the stage of an inductive process, there are two things which we should deal 
with: 

1. Any R-reconstruction selected should contain all instances accepted in the 
first the (n — stages. To do so, we introduce a sequence A to store these 
accepted instances. 

2. If an R-reconstruction is not selected properly, some information may still be 
lost. For example, consider F = {A A B}, both F \~ A and F B are provable. 
Assume that A has met a rejection by facts, then the maximal subset of F which 
is consistent with -'A is the empty set. Thus, the R-reconstruction (of F for A) 
is the empty set. B does not meet a rejection, but is missing after the revision 
rule is applied to A\ In order to avoid this to happen, we introduce a sequence 
& to collect those instances Am which are taken to be as logical consequences 
in case 1. After each R-reconstruction is taken, the procedure checks all Am 
contained in G, and picks up the lost ones back. Since, at any evolution stage, 
0 is always finite, the checking will be terminated. 

The above procedure scheme can be specified by a Pascal-like procedure. In 
the procedure, * is used to denote the concatenation or union of two sequences. 
The procedure can also be specified by using some transition systems which can 
be found from [10]. 

Definition 9. An inductive procedure 

Let M be a given procedure and £m = {Aj} be a complete instance sequence 
of M, F be a theory. Let Fi = F, F„+i is defined by the following procedure 
declaration (see next page): 

A sequence {F„} generated by the procedure GUINA is called a complete 
inductive process (or sequence) of Fm and F, if the complete instance sequence 
£m of the model M and theory F are taJcen to be the actual parameters of the 
the procedure GUINA, 

In the this procedure scheme declaration, a composition of inductive inference 
rules are used in some cases. For example, in the third case (the second else if), 
when the condition head{Tn) = P{t) and F„ 1/ 3x.-<P are satisfied, the procedure 
will execute as follows: 

1. If F ^ PN(Fn), then the inductive extension rule is applied first, and F„ => 
P{t),Fn is obtained, and then, since F„ 1/ 3x.-iF, the inductive generalization 
rule is further applied to P{t) and F„, that is F(t),F„ => P(t), Vx.F(x),F„. 
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procedure GUINA( T: F-seq: var F: F-seq); 

var 0 = 0: F-seq; % F-seq denoted a sequence of sentences 
var A = 0: F-seq; 
begin 
loop do 
begin 

if Fn I- head{Tn) 
then begin 

^n+l •— -fri) '^n+1 tail(Fn)‘, 

^n-fi An] ^n+1 ; 

end 

else if Fn 1 — <head{Tn) 
then begin 

Fn+i := A, where A e A{Fn,-'head{Tn)) and An C A] 
Fn-\-x head(Fn^ ^ ^ taillfJn')^ 

®n+l •“ ®nj ^n+1 •“ An 
end 

else ifhead{Tn) = P{t) and Fn \f 3x.-iP 
then begin 

Fn+i -.= {P(t),Vx.P(x)}*r„; 

Fn+l •” tailifTnfj] 

0n+\ := On] ^n-M := {-P(0} * 

end 

else if head{Tn) = B and there is A such that 
Fn\- Ad B and P„ 1/ ~iA 
then begin 

Pn+l •= {A, B} + Fn] % 1+1 tail(Tn)] 

On+1 •— On * {P}i ^n+1 
end 

else begin 

P„+i := {head(7;i)} * P„; 

Fn+\ •“ tail(Fn)] 

®n-|-l On] ^n-|-l •— An * {head(Fn)} 
end 



end 
end loop 
end 



2. If P G PN{Fn), since P„ 1/ Bx.-iP, we have 

Fn P{t),Fn- 

and then, the inductive generalization rule is applied to P(t),P„, and 
P(t),Fn => P(t), Vx.P(x), P„ is obtained. In the second case, the rule: 

P S PN{F) F and P(t) are consistent 
F=^P(t),F 
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is used. We did not define this rule as a basic inductive rule since it seems trivial. 
It can be derived from the sufficient condition rule. Let us see how the procedure 
works through the following example: 

Example 5. Let H ={a, b}, P= {P(a), P(b), <3(a), Q(h)}, under the modelM the 
interpretation of Pis em = {P(fl),->P( 6 ), Q{a)}. The procedure scheme 5.3 can 
inductively produce 'ixP{x) D Q{x). 

1. The first step: Pi = {}, em = {Pia),-<P{b),Q{a)}, Ti = em, 0i = = 0. 

2. The inductive generalization Using the rule of inductive generalization, 

3x^P{x) ^ Th{Pi) 

P{a),ri => P{a),\/xP{x),Pi ’ 

we obtain P 2 — {P{a),'ixP{x)}, T 2 = {-iP(b),Q(o)}, &2 =$, A 2 = {P(a)}. 

3. Using the revision rule: 

P{b) e Th{P 2 ) {P(g)} £ A{P 2 ,P{b)) 

P2 {P(a)} 

we obtain P 3 = {P(a)}, % — {->P{b),Q{a)}, ©3 = 0, ZI 3 = {P(a)}. 

4. Using the inductive extension rule, we obtain: P 4 = P 3 U {-iP(b)},thus P 4 = 
{P(a),-P( 6 )}, T 4 ={Q(o)}, 04=0, ^ 4 ={-P(t),P(o)}. 

5. Using the inductive generalization again 

3i^Q(x) ^ ThiPj) 

Q{a),Pi =► Q(a),VxQ(x),r 4 

we obtain Ps = {P(a),-iP(b),(3(a),VxQ(x)}, 7s = 0, ©5 = 0, ids = {Q{a), 
-P(b), P(a)}. 

When n > 5, P„ = Ps =: {P(a), -<P{b), Q{a), VxQ(x)}. 

6 . Deductive inference: Since P„ I — 'P(b), by 3 rule, we have: P„ h 3x-,P(x). 
Using V rule we obtain P„ (- 3x-,P(x) V Q(x). But 3x-iP(x) V Q{x) = 
-iVxP(x) V Q(x) = \fxP{x) D Q(x). Therefore, when n > 5 VxP(x) D 
Q{x) e Th(P„) can be derived, that is 

'ixP{x) D Q{x) € (Th{rn))^. 

□ 



For the inductive procedure scheme 5.3, the following theorem holds: 

Theorem 3. Let Mp be a model for a specific problem p, Tm^ be the set of its 
all true sentences, be its complete instance sequence, and P be a theory. If 
{P„} is an inductive process of and P, and is generated by the inductive 
procedure 5.3, then {Pn} is convergent, and 

lim r/i(P„) = Tmp- 

n— >00 
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Proof. By structured induction (see Appendix ) □ 

The meaning of theorem 5.2 can be understood as follows: For a particular 
problem Mp, one does not know all its laws at the outset of his studying of the 
problem. As a matter of fact, his knowledge of the problem is being constantly 
improved and deepened throughout the inductive process. An instance sequence 
^Mp serves as the criteria upon which to test the inductive inference; one instance 
of £mp is examined each time; if it is a logical consequence of the current version 
of a theory, then it is taken to be an evidence of the theory; if it is a rejection 
by facts of the current version, then the new version of the theory is an ideal 
revision of the current version; otherwise, the new version is induced from the 
instance in accordance with the inductive inference rules given in section 2. If 
£mp is a complete instance sequence of Mp, then the sequence of the versions 
of the theory is convergent, its limit is Tm^- In fact, this can be seen as the 
rationality of the inductive inference rules. 

Finally, the inductive procedure 5.3 is just a simple strategy to generate con- 
vergent inductive processes. In fact, one can construct many procedure schemes 
this kind. It is a major objective of research of “machine learning” to look for the 
efficient inductive algorithms which generate fast convergent inductive processes. 
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Abstract. Many researchers have proposed argumentation-based rea- 
soning as a viable eilternative to reasoning systems with a flat episte- 
mological structure. Perhaps one of the longest standing approaches has 
been in the Oscar project, led by John Pollock. Unfortunately, without a 
form2il semeintics, it is often difficult to evaluate the various inc2irnations 
of defeaisible reasoning. We provide a semantics for Pollocks defeeisible 
reasoning in terms of Bondarenko et od.’s unified framework for default 
reasoning. We also indicate some internal inconsistencies between the 
motivation behind and definition’s governing Pollock’s system. 



1 Introduction 

The OSCAR system is the implementation of a combination of many theories of 
rationality, and is probably most completely described in [7]. One major com- 
ponent, which is of particular interest to those researching commonsense and 
non-monotonic reasoning, is based around defeasible or argumentation based 
reasoning. It is this component with which we will be solely concerned. Pollock 
argues for this type of structure sensitive approach as a basis for epistemological 
theories, which he contrasts with the more traditional approach of logic based 
theories. In particular [6] he argues that his formalism captures aspects of rea- 
soning which have resisted logic based approaches, using as examples situations 
in which default logic [9] and circumscription [5] fail to deliver answers which he 
considers appropriate. However, as it is presented in [6], the rules which define 
defeasible reasoning in Oscar fail to capture the intuitions which were employed 
to motivate them. We present two examples (self-defeat and cycles of defeat - 
both of which were presented in [6]) that are handled in a manner contrary to 
our expectations. 

No formal semantics has been defined for Pollock’s system thus far. This 
causes two problems: (i) it becomes difficult to carry out a systematic study 
and/or extend Pollock’s system (e.g. to specialise it to fit into some specific 
formalism or a special domain); and more importantly, (ii) it is not always clear 
whether this system will give sound conclusions. A well-defined semantics is thus 
desirable, especially in light of the examples presented herein. 

* An extended version of this paper with full proofs of all theorems is also available. 

The reader should feel free to contact the authors for a copy. 

N. Foo (Ed.); AI’99, LNAI 1747, pp. 316-327, 1999. 

(£) Springer- Verlag Berlin Heidelberg 1999 
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1.1 Outline 

First, we present Bondarenko et al.'s system, then re-express the argument struc- 
ture proposed by Pollock [6] in terms of algebraic expressions so that it becomes 
comparable to other frameworks in default reasoning, in particular, Bondarenko 
et aZ.’s. 

We also rectify some limitations in Pollock’s representation of argument 
structures to guarantee the integrity of the representation and note some in- 
ternal inconsistencies in it as presented in [6]. Finally, we provide a semantics 
for Pollock’s system in terms of Bondarenko et al.'s. 

2 Bondeirenko et aZ.’s System 

The original approach as presented by Dung in [4] is an abstract and general 
framework for argumentation which is independent of any system defined on top 
of this formalism. It defines an argumentation framework as a pair {AR, attacks) 
consisting of a set AR of arguments and the attack relationship between argu- 
ments, i.e., attacks C AR x AR. 

Further refinement led by Bondarenko et al. results in a unified framework 
for default reasoning [1]. A close examination of Pollock’s and Bondarenko et 
al.'s approaches shows some intimate connections between the two formalisms. 
We reproduce the relevant definitions from Bondarenko’s work for completeness. 
A deductive system is a pair {C, TZ) where 

-£ is a formal language consisting of countably many sentences, and 
-TZ is a set of inference rules of the form 

ai,...,an 

a 

where a, oi , . . . , a„ € £ and n > 0. 

- Any set of sentences T C £ is called a theory. A deduction from a theory T 
is a sequence - . . ,Pm, where m > 0, such that, for all i = 1, . . . , m, 

• 0i eT, or 

• there exists in TZ such that ai, ... ,an S {/?i, . . . 

— T h a means that there is a deduction from T whose last element is a. Th(T) 
is the set {a € £ |T 1- a}. 

Definition 1 [1] Given a deduction system (£,72.), an assumption- based frame- 
work -with. respect to (£,72) is a tuple {T,Ab,~) where 
-T, Ab C C and Ab ^ 0, 

-~ is a mapping from Ab into £, where a denotes the contrary of a. 
Definition 2 [1] Given an assumption-based framework (T, Ab, ~) and A C Ab, 
-A is conflict-free iff for all a e Ab, TU A\/ a,a, 

-A is maximal conflict-free iff A is conflict-free and there is no conflict-free 
A' D A. 

Definition 3 [1] Given an assumption-based framework (T,Ab,~), 
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-a set of assumptions A C Ab attacks an assumption a & Ab iSTU A\~a, 
-a set of assumptions A C Ab attacks a set of assumptions A' C Ab iff 
A C Ab attacks some assumption a € A'. 

If A attacks a (respectively A'), we also say that A is an attacker against 
a (resp. A'), or a-attacker (resp. ^'-attacker). Notice the deviation from the 
notation of the original in [1]. 

Definition 4 [1] Given an assumption-based framework (T, Ab, ~), and a set of 
assumptions A C Ab: A is closed iff ^ = {a € |T U ^ h a}. 

Definition 5 [1] A set of assumptions A is stable iff 
-A is closed, 

-A does not attack itself, and 
-A attacks each assumption a ^ A. 

Definition 6 [1] 1. An assumption a G Ab is said to be defendeded with respect 
to a set /I C Ab iff for each closed set of assumptions A' C Ab, if A' is a minimal 
Q-attacker then A attacks A'. 

2. A closed set of assumptions AC Ab is admissible iff 

- A does not attack itself, and 

- each a in ^ is defended with respect to A. 

Definition 7 [1] A set of assumptions ^ C Ab is preferred iff A is maximal 
(with respect to set inclusion) admissible. 

3 Pollock’s System 

In Pollock’s system, one represents 
an argument as a graph in which the 
nodes represent steps of inference in 
the argument. The directed edges in 
the graph are of three types. The first, 
conclusive reasons, represent deduc- 
tive links or logical entailment from 
one node to another. The second, 
prima facie reasons, represent defea- 
sible reasons to conclude one node given another. The third, defeaters, represent 
reasons not to accept a node that was only defeasibly concluded. In this paper, 
we limit our discussion to graphs that have no conclusive reasons, since the in- 
tended resolution of a situation in which a node is both conclusively deduced 
and also defeated is not clear in Pollock’s work. We follow Pollocks conventions, 
representing prima-facie or defeasible reasons with dashed arrows, and defeaters 
with filled arrows. 

As presented in [6], at most one justification is allowed for each node. That is, 
all of the prima facie reasons for a node in a graph are required, under conjunc- 
tion, to justify a conclusion together. To represent multiple justifications^ we 

’ which Pollock does by variously appeeiling to the use of multiple graphs, or modifying 
the graphs to be AND/OR graphs 



0 © © © 




Fig. 1. A simple inference graph 
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introduce the concept of labelled inference links so that different justifications of 
a node can be identified with different labels, not multiple graphs. For example, 
in figure 1, node D is justified either throuh justification 1 (requiring support 
from both node A and node B) or through justification 2, requiring support only 
from node C. Node F has only one justification, namely 3. Nodes D and F defeat 
each other. 

Definition 8 An inference graph is a labelled directed graph {V, L, X, V) consist- 
ing of a finite set of nodes V representing the steps of inference (or arguments), 
and L is a set of labels which gives each justification of a node a unique name. 
Furthermore, 

-ICVxFxLis the set of labelled inference links (also called reasons); and 

— T> CV X V is the set of defeat links (also called def eaters). 

Let a be a node of an inference graph {V,L,X,V), we will denote 

• the set of labels that denotes the justifications of a as a-bases, 

• the set of nodes that are connected to a through a defeat link as a- 
defeaters. 

• each A G a-bases is associated with a set of nodes which are required to 
fulfill the justification labelled A. We denote this set by X-root. 

- In a graph F a node a is initial iflF a-defeaters = 0 and for every /3 in every 

A-root of a, ^ is initial. We call the set of initial nodes in the graph F-initial 

For instance, in the example in Figure 1, we have: 

- D-base = { 1 , 2 }, 

- D-defeaters = {F}, 

- The 1-root of D is {A, B}, and the 2-root of D is {C}. 

Definition 9 A status assignment a of an inference graph F = (F, L, X, V) is a 
function V — > {defeated, undefeated} which indicates the status of nodes in V. 
A status assignment must also abide by the restrictions which amend those in [ 6 ] 
to admit multiple justifications. 

An assignment a of defeated and undefeated to a subset of the nodes of an 
inference graph is a partial status assignment iff for all a G F, 

(1) cr(a) = undefeated if a G T'-initial 

(2) o-(a) = undefeated iff a{!3) = undefeated for all (3 in some A-root of a, 
and ( 7 ( 7 ) = defeated for all 7 G a-defeaters 

(3) cr(a) = defeated iff cr(/3) = defeated for some 0 in every A-root of a, or 
< 7 ( 7 ) = undefeated for some 7 G a-defeaters 

(7 is a status assignment iff c is a partial status assignment and a is not 
properly contained in any other status assignment. 

We use Status Assignment (i.e. with capital letters) to refer to the intersection 
of all status assignments of a single graph F. 

There are two possible interpretations of this definition. The first corre- 
sponds (7 being a total function over some sub-set of F, with all nodes not 
in the domain of a being removed from the graph. The second corresponds to 
(7 being a partial function over the entire graph F. It is Pollock’s intention that 
the first of these interpretations be used [personal correspondence] . 
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In situations where there are more than one status assignment, one takes the 
skeptical approach, giving the following definition: 

Definition 10. [6] [7] A node is considered undefeated iff every status assignment 
assigns undefeated to it; otherwise it is defeated. Of the defeated nodes, a node 
is defeated outright iff no status assignment assigns undefeated to it, otherwise 
it is provisionally defeated. 

We introduce a fourth definition, for clarity. 

Definition 11. A total status assignment cr of a graph F = {V,L,2,'D) is a 
status assignment and is total over V. 

For the intended interpretation of status assignment to give the proper re- 
sults, it must be the case that a partial status assignment can only be admitted if 
it is maximal in two senses: (i) that it not be properly contained in any other sta- 
tus assignment, as per Definition 9, and (ii) it also must come from a sub-graph 
which is maximal in the following sense. A sub-graph of F, say F\ is maximal 
with respect to status assignments on F iff F' has a total status assignment and 
no graph F" D F', where F” C F, has a total status assignment. 

Without this addition, the system as originally presented by Pollock must 
consider any node with incident defeaters as provisionally defeated^, and hence 
only initial nodes could ever be considered undefeated. 



3.1 Provisional and Outright Defeat 

In Definition 10, Pollock draws the distinction between the provisional and out- 
right defeat of a node. Although they do not figure directly in any of his more 
formal definitions, they form the basis of the motivation for these definitions [6] 
and [7]. From [6], a provisionally defeated node should be infectious; that is, 
unable to support any inference, but able to defeat another node. An outright 
defeated node should not be able to do either. Although the formal definitions 
capture Pollock’s intentions in graphs that have total status assignments, they 
fail to do so in graphs that have only partial status assignments. That is, the 
motivations behind the definitions are inconsistent with the definitions them- 
selves. 

In particular, cases of so-called self-defeat, where a node defeats one of its 
(possibly indirect) ancestors, were one of the primary motivators for the notion of 
outright defeat [6]. However, a self defeating node results in only the provisional 
defeat of that node. The admission of partial status assignments recognises that 
there can be arguments, for a node which are inconsistent. However, the manner 
in which they are handled essentialy involves hoping that the Status Assignment 
generated by taking the intersection of all status assignments captures the in- 
tended meaning of the original graph. The example of a self-defeating argument 
for a node, which should result in the that node’s outright defeat, shows that 

^ This follows immediately from allowing cr to be defined over arbitreiry subsets of V. 
For instance, consider the status assignment for the graph in Figure 1 contmning 
only four nodes A, B, C and E. If this were admited, then E would be considered 
provisionally defeated in the Status Assignment of V. 
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this is not always the case. A similar argument can be made regarding defeat 
cycles of odd length. 



4 Relationships between Pollock’s and Bondarenko et 
aVs Frameworks 

Despite the similar appearance of the two frameworks proposed by Pollock and 
Dung (in the most straightforward way, one would think about an equivalence 
between the attacks relationship in [4] and the defeat links in [6]), there are no 
obvious ways to subsume one under another or to prove their equivalence. This 
is due to differences underlying the basic structures of the two frameworks. 

From the above re-formalisation of Pollock’s system in terms of algebraic 
expressions, however, one can see resemblances with Bondarenko et al.’s system. 
However, some minor differences include: (i) the structure of the arguments are 
not restricted in Pollock’s system while Bondarenko et al. require the elements 
of an assumption-based framework to be logical sentences and the links among 
them to be inference rules; (ii) in Pollock’s system the defeating relationships 
are explicitly represented, where in Bondarenko et al.’s they must be derived 
through the derivation of the contrary of assumptions. 

Despite the above differences, there seem to be many commonalities between 
the two frameworks. Firstly, the central notions of the two systems model both 
the reasoning process as arguments and the defeat (or, attack) relationships be- 
tween the basic elements of each system. Secondly, the intuitions behind the 
semantics (resp. status assignments) of assumption-based frameworks (resp. in- 
ference graphs) seem to be closely related. For instance, the stable model se- 
mantics [1] requires a stable extention to attack every assumption which does 
not belong to it. There is an analogous requirement for status assignments: that 
the accepted nodes be assigned undefeated, while the rest are assigned defeated. 
Further, if a status assignment assigns defeated to a node, it must assign de- 
feated to one of its ancestors, or undefeated to one of its defeaters: this is similar 
to being attacked. A similar correspondence between preferred extensions and 
partial status assignment can also be observed.^ In light of these similarities, 
one may ask: Are these two systems actually equivalent in some sense, especially 
from a semantic point of view, i.e., will they produce the same answer for every 
problem domain? 

In general these two systems don’t produce exactly the same results as each 
other, in particular when there are no stable extensions or total status assign- 
ments for the given problem. This should also be expected because of the differ- 
ent motivations behind each system: The credulous semantics play an important 
part in [1], [4], whereas Pollock’s system maintains a sceptical approach. Notice 
also that the sceptical versions of the semantics proposed in [1] do not correspond 
exactly to the assignment of defeated and undefeated to the nodes of inference 

^ Part of this result is reported in this paper (Theorem 2 and Corollary 2). The other 
peirt appears in an extended version of this paper. 
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graphs. However, since both approaches have their own suitable applications, 
a further study to combine the merits of both of them into a single system is 
desirable. That is the main aim of this paper. 

The first observation is that a common language is necessary to compare the 
results of both systems. On the surface. Pollock’s system seems more general 
than Bondarenko et al.’s system since no restrictions are imposed on the internal 
structures of the arguments or the nodes in an inference graph. However, we will 
embed Pollock’s system into the formalism proposed in [1]. There are two reasons 
that we wish to do so: (i) it has been argued elsewhere [10], Pollock has taken 
his research in a direction which is too general for AI’s uses; and (ii) Bondarenko 
et a/.’s system has several well-defined (and well understood) semantics and has 
close connections with most major frameworks in non-monotonic reasoning . 



4.1 The Trruislation 

To define the transformation of an 
inference graph to an assumption- 
based framework, we introduce the 
function T defined as follows: let S 
be a set of symbols, T{S) = {T{9) 
\9 € 5}. Intuitively, elements of the 
set T(5') play a similar role to the jus- 
tifications in default rules as defined 
in [9], T{9) is read as “it is consistent 
to assume that 9 does not hold” . 

Given an inference graph F = (V, L,X, V), for convenience, we define F-defeaters 
= {(5 e V I for some a € V, {5, a) € V}. 

Definition 12 Let F = {V,L,X,T>) be an inference graph, the translation of F 
to an assumption- based framework, denoted as ABF{F), is defined as follows: 
ABF{F) = (T,Ab, -)), where 

- T = V U 2'(J'-defeaters), 

— 71. is the smallest set such that for each node a in V: let o-defeaters = 

, 5m} and if A e a-bases and A-root = {oj , . . . , a^} then 

a^...,g^,r(5i),...,r(5m) 

a 




Fig. 2. A simple inference graph 



— T — P-initials, 

— Ah = T(T'-defeaters), and 

— for each a € Ab, if a = T(9) then a = 9. 

Example 1 Consider the inference graph in figure 1: 

A = ({A, B, C, D, E, F}, {1,2, 3}, {(A, C, 1), (B, C, 1), (D, C, 2), (B, F, 3)}, 
{(D,F),(F,D)}), which will be translated to an assumption-based framework 
as follows: 
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ABFiFi) = (({A, B, C, D, E, F}, ^^}), 

({A, B, C, E}, {r(D),T{F)}, -)), where r(D) - D and T{F) - F. 

Example 2 Consider the inference graph in figure 2: 

F2 = {{A, B, C, P, Q, R}, {1,2, 3}, {(A, P, 1), (P, Q, 2), (C, R, 3)}, {(P, Q), 

(Q, R), {R, P)}), which will be translated to an assumption-based framework as 
follows: 

ABF{F2) - (({A, B, C, P, Q, P}, 

{{A,B,C},{T{P),r{Q),r(R)},-)), where T(/i) = /i for each ^ e {P,Q,R}. 

There are several semantics proposed for assumption-based frameworks as 
has been mentioned above. Also, Pollock introduces a semi-formal system for 
inference graphs in which the notions of status assignments and partial status 
assignments have been employed to give the acceptability (or alternatively de- 
feat) status for each node in the graph. In addition, an algorithm has also been 
provided.^ The interesting question is: Do the notions of status assignments and 
partial status assignments correspond to any of the formal semantics introduced 
by Bondarenko et ai? 

Before we answer, we make two notes. Firstly, in Pollock’s work, the notions 
of “defeated” and “undefeated” have been used throughout while Bondarenko 
et al. use the notion of “acceptability” in most cases. There is an obvious cor- 
respondence between the two notions of “undefeated” and “acceptability” . We 
will use “undefeated” in contexts which are more or less involved with infer- 
ence graphs while “acceptability” will be used more often when we talk about 
assumption-based frameworks. Secondly, we will only consider stratified infer- 
ence graphs in the sense that: (i) nodes are finitely reachable from the initial 
nodes; and (ii) no cycles that involve only inference links are allowed®. We give 
the formal definition for stratified inference graphs: 

Definition 13 Let P = {V, L,T, V) be an inference graph, P is said to be strat- 
ified if and only if it contains no infinite path of the form ai , . . . , a„, . . ., where 
for each i > 1, there exists a A € L such that (oj, Oj+i, A) € X. 



* Whether or not the algorithm does correspond exactly to the definitions and the 
definitions correctly reflect the intuitions motivating Pollock’s framework is another 
story. Certainly, it is unclear whether the algorithm presented in [6] captures the 
definitions presented therein. 

® In this case, there are at most two ways to assign statuses to the nodes involved in 
the cycle regardless of the assignment given for other nodes, naimely aU of them axe 
assigned either defeated or undefeated. It boils down to the following three cases: 
two trivial cases in which (a) adl nodes must have the defeated status; and (6) all 
nodes must have the undefeated status. The only non-trivial case arises when both 
status Eissignments Me consistent with the given graph. In this case none of these 
nodes will be accepted in a scepticad extension of the given inference graph which 
takes the intersection of aill assignments. Though it is not impossible to obtaiin the 
results below with the presence of cycles of inference links (the obvious way is to 
expand the set of assumptions to include all elements of the set V as well), this 
restriction will significaintly simplify the presentation eind concentrate on the essence 
of the results rather them getting bogged down with unnecessary details. 
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Notice that our notion of stratification of inference graphs as defined in the 
above definition has no correspondence with the notion of stratified assumption- 
based framework as defined by Bondarenko et al. The key difference is that we 
try to avoid the cycles of inference links for the sake of simplicity whereas strati- 
fied assumption-based framework is defined on the basis of so-called attack rela- 
tionship graphs. Say in other words, by stratified assumption-based frameworks, 
Bondarenko et al. rule out infinite chains of attack links® (including cyclic chains 
of attack links) in assumption-based frameworks. It is obvious that in most ap- 
plications, the arguments, and thus the assumptions, would usually attack each 
other. Therefore, Bondarenko et aUs stratified assumption-based frameworks are 
not very useful for general real-world applications. 

Henceforth, when there is no possible confusion, we will refer to stratified 
inference graphs simply as inference graphs. 

4.2 The Correspondence 

We show that the above translation captures exactly the intuitive meaning of the 
defeat links within the attacks relationships. The following theorem shows that, 
under the above translation, the stable model semantics corresponds exactly to 
the status assignments as defined by Pollock. 

Lemma 1 Let F — (V, L,J,X>) be an inference graph and A a set of as- 
sumptions of ABF{F). 

T\JA\- T{a) if and only if T{a) e zi. 

The proof for the above lemma is trivial according to the fact that the con- 
sequents of the inference rules in TZ can only be members of the set V. This 
lemma implies that every set of assumptions A C Ab of ABF{F) is closed. 
Those assumption-based frameworks that enjoy this property are said to be flat 
following Bondarenko et al. (1997). 

We now proceed to the main lemma of this paper. 

Lemma 2 Let F = (V, L,I,T>) be an inference graph and a a status assign- 
ment for F. Let Aa = {T(J) € Ab |(j((5) = defeated} be a set of assumptions 
of ABF{F). a e Th{T U A^) iff a{a) — undefeated, where T is the theory of 
ABF{F). 

Then part (a) of the following theorem is an immediate corollary of the 
above lemma. For the converse, we introduce the notion of a status assignment 
associated with a set of assumptions with respect to the translated assumption- 
based framework of a given inference graph. Let F = (V, L, I, V) be an inference 
graph and A a set of assumptions in ABF{F), the mapping which assigns 
the “undefeated” and “defeated” status to elements of V, defined as follows: for 
each a £V, 

. , _ f undefeated, iff a € Th{T U A) 

(oi) ^ defeated, otherwise. 

® Bondarenko et al.’s attack links are similar to defeat links in PoUock’s terms imder 
om- translation. The difference is that a defeat link coimects between two nodes 
of an inference graph whereM the atteick links connect betweeen assumptions. The 
essential details make no difference since both kinds of links play the seime role. 
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is called the Z\-assumption assignment. 

Theorem 1 Let F = {y,L,I,V) be an inference graph. 

(a) If (T is a status assignment for F then the set of assumptions A„ is stable 
in ABF{F). 

(b) If 4 is a stable set of assumptions in ABF{F) then a a is a status as- 
signment for F. 

Example 1 (continued.) In Pollock’s framework, there are two total status 
assignments for Fi , denoted as and which are defined as follows: 

cr\(A) — (Ti(B) = (t\(C) — (Ti{D) = (tI(E) — undefeated and a-}(F) = 
defeated, and 

( 12 (A) = ct 2 (B) = ct 2 (C) = (T^E) = cr^F) = undefeated and ctKD) = 
defeated. 

On the other hand, we have the following two stable set of assumptions in 
ABF(Fi): A\ = {r(F)} and A^ = {T(D)} which respectively correspond to the 
following two extensions of ABF(Fi): {A, B, C, E, D, T(F)} and {A, B, C, E, F, 
T(D)}. 

One thus can observe the correspondence between the notions of status as- 
signments and stable extensions in this example. 

Corollary 1 Let F = (V,L,F,T>) be an inference graph. 

(a) If CT is a (total) status assignment for F then = cr, 

(b) If .4 is a stable set of assumptions in ABF(F) then A,^^ = A. 

Notice also that there are no (total) status assignment for the inference 
graph F2. This is the case for self-defeating arguments and odd length cycles 
of defeat links (which is the case for F2) as has been pointed out by Pollock [6]. 
On the other hand, the assumption- based framework ABF(F2) does not possess 
any stable set of assumptions since any set of assumptions that contains at least 
two of the assumptions T(P),T(Q), and T(R) should attack itself and any set 
of assumptions that contains at most one of the assumptions does not attack 
some other assumption, e.g. the set A = {T(P)} does not attack T(R). 

It is not surprising that (total) status assignments in Pollock’s system exactly 
correspond to the stable model semantics of Bondarenko et aUs framework, and 
therefore also fit into the standard semantics of most non-monotonic reasoning 
formalisms including default logic, non-monotonic modal logic, autoepistemic 
logic, logic programming and some fragments of circumscription as has been 
shown by Bondarenko et al. But it is well-known that the more important fea- 
tures in both frameworks proposed by Pollock and Bondarenko et al. lie in the 
extensions of their frameworks - the partial status assignments, and the pre- 
ferred and admissible semantics respectively. These extensions allow their sys- 
tems to cope with some interesting problems which can not be solved by many 
traditional non-monotonic formalisms. We now investigate the the relationships 
between these notions more closely. 

Theorem 2 Let F = (V,L,I, V) be an inference graph. If cr is a partial status 
assignment for F then the set of assumptions A„ is admissible in ABF(F). 
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Corollary 2 Let F = {V,L,T,T>) be an inference graph. If a is a maximal 
partial status assignment for F then the set of assumptions A„ is preferred in 
ABF{F). 

Example 2 (continued.) In Pollock’s framework, there is a single partial 
status assignment for F 2 (which is also maximal), denoted as defined as 
follows: 

cr^(A) = cr‘^(B) — cr^(C') = undefeated and cr^(P) = cr^(Q) = = 

unassigned. 

This allows us to compute the following preferred set of assumptions in 
ABF(F 2 ) (following the definition given above): A'^ = A „2 = 0. This set of 
assumptions is associated with the preferred extension consisting of only A, B 
and C. 

It is obvious that the converse of Theorem 2 does not hold. That the converse 
of Corollary 2 holds is reported in the extended version of this paper. 

5 Discussion 

As has been discussed throughout this paper, the argumentation systems as pro- 
posed by Pollock and Bondarenko et al. have some important strengths in com- 
parison to other frameworks in defeasible and non-monotonic reasoning. Among 
the significant advantages of these approaches is their generality. Both are ab- 
stract argumentation frameworks which in general do not depend on a particular 
underlying logic or language. This feature allows both systems to capture most 
reasoning facilities sanctioned by well-known non-monotonic reasoning mecha- 
nisms. In particular, as it has been proved in [1], most non-monotonic reasoning 
mechanisms can be expressed in terms of assumption-based frameworks. 

While Pollock’s system is presented with some seemingly natural representa- 
tions of argument structures (inference graphs), this system suffers from its lack 
of a well-defined formal semantics. As has been shown in this paper, some notions 
which seem to be intuitive at first sight may not turn out to be captured by the 
definitions. To rectify this shortcoming of Pollock’s system and at the same time, 
provide an insight into the two systems and their relationships, we introduced 
a translation which has been proved to be sound and complete with respect to 
the stable semantics in terms of assumption- based frameworks [1]. Furthermore, 
we also examined the aspects of this relationship when the standard semantics 
of default logic, logic programming, autoepistemic logic, etc. does not exist. 

The subsumption of Pollock’s systems under Bondarenko et a/.’s framework 
suggests that the assumption-based framework is very powerful. However, it 

seems that the generality and expressiveness of assumption-based frameworks 
are not always desirable. Firstly, certain peculiarities may arise when arbitrary 
relationships are allowed in an assumption-based framework.^ More seriously, 
Bondarenko et aZ.’s system does not provide any way to detect and avoid such 
situations. Secondly, there has always been a trade-off between expressiveness 

^ £in example of these situations is presented in the extended version of the present 
paper. 
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and computation. As has been shown in [3], except for the case of skeptical ad- 
missible semantics which comes down to trivial monotonic reasoning, computing 
preferred semantics is in general harder than computation of stable model se- 
mantics in logic programming. 

5.1 Future work 

We would like to provide a set of definitions in Pollock’s framework which prop- 
erly capture the notion of outright and provisional defeat, but maintain the 
elegant locality of the rules that currently define his system, or prove that you 
cannot. We hope to extend the results in this work further and in more detail to 
shed light on the significance of argumentation-theoretic approaches in AI and, 
in particular, reasoning. 
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Abstract Description logics are powerful knowledge representation systems pro- 
viding well-founded and computationally tractable classification reasoning. How- 
ever recognition of individuals as belonging to a concept based on some approx- 
imate match to a prototypical descriptor has been a recuning application issue 
as description logics support only strict subsumption reasoning. Expression of 
concepts as a disjunction of each possible combination of sufficient prototypical 
features has previously been infeasible due to computational cost. Recent optimi- 
sations have greatly improved disjunctive reasoning in description logic systems 
and this work explores whether these are sufficient to allow the heavy use of dis- 
junction for approximate matching. The positive results obtained support further 
exploration of the representation proposed within real applications. 



1 Introduction 

Description Logic systems are knowledge representation systems based on First Order 
Logic or a subset of FOL chosen specifically for computational reasons. They have 
a Tarski-style semantics and a syntax which is particularly well suited to an object- 
oriented approach to describing concepts (classes) and individuals. Key functionality of 
DL systems is based on the notion of subsumption testing which is used both in building 
the class hierarchy and in recognising which concepts individuals belong to. This class 
of systems have been successfully used in a range of applications (e.g. [5]). However a 
number of other applications have experienced difficulty due to the fact that recognition 
of individuals as belonging to a concept is done on the basis of the individuals having 
all characteristics defined for that concept. There is no notion in DLs of exceptions or 
of default or typical characteristics.^ 

We argue that in many applications what one wants is a notion of individuals being 
recognised as members of a concept, based on having a sufficient number of a cluster of 

* Some work has been done on extending DLs to include defaults (e.g. [3, 17, 18]) but this is 
primarily directed towards default reasoning, rather than the issue of using defaults for recog- 
nition as is being explored here. 

N. Foo (Ed.): AI’99, LNAI 1747, pp. 328-339, 1999. 
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typical characteristics (plus possibly some necessary characteristics), rather than simply 
a fixed set of characteristics. 

For example Coupey and Fouquere describe how for recognising faults in a telecom- 
munications application it is absolutely necessary to be able to take account of default 
characteristics [6]. However they take an approach of requiring that individuals explic- 
itly have an exception to a default characteristic to allow recognition. This can be awk- 
ward, and is not always even possible. A medical diagnosis application [18] similarly 
needs to use typical symptoms to describe diseases, but does not want an individual 
presentation not to be recognised because it does not have all typical symptoms. 

In order to meet the needs of the many applications similar to these it is necessary to 
be able to define a set of typical characteristics associated with a concept. An individual 
belonging to the concept is required to have some “critical mass” of these character- 
istics, and should be recognised as belonging to the concept on this basis. The most 
obvious way to achieve this within the semantics of first order logic (and Description 
Logics) is to define the concept as a disjunction of all the combinations of sufficiently 
many typical characteristics. 

In an example from [18], chronic pyelonephritis is described as having the charac- 
teristics urine.dysuria, urine.casts, fatigue and urine.bacteria. Defining the concept on 
the basis of 75% of these characteristics would give us: 

CPN= (urine.dysuria fl urine.casts PI fatigue) U 
(urine.casts n fatigue D urine.bacteria) U 
(fatigue fl urine.bacteria D urine.dysuria) U 
(urine.bacteria n (urine.dysuria fl urine.casts). 

Previously the intractability of the algorithms used for reasoning with disjunctions 
has meant that heavy usage of disjunctions is not a viable option computationally for 
real application systems. 

For example, Kris [2], one of the first DL systems that included principled rea- 
soning with disjunction at all, exhibits very poor performance when reasoning with 
knowledge bases (KBs) containing significant numbers of disjunctive concepts. 

Optimizations of Kris that allowed it to obtain similar performance characteristics 
to CLASSIC [1] (the most efficient of the set of tested DL systems [10]), did not address 
optimisations for disjunctive concepts (which cannot be represented in CLASSIC). 

The new algorithms and optimisation techniques recently developed allow the typ- 
ical case reasoning performance of DL systems to be radically improved [14]. These 
optimisations are particularly effective with respect to disjunctive reasoning. However 
there has been no experimentation which pushes the limits of these new algorithms, 
or examines whether they are adequate for particular application oriented needs which 
require heavy use of disjunction. 

The work presented in this paper explores whether these techniques are in fact suf- 
ficiently powerful to support the routine use of disjunctive concepts to address the ap- 
plication issue of approximate matching to a prototype for recognition of individuals. 
Section 2 describes a representational model for defining concepts; section 3 describes 
in some detail the problem with disjunctive reasoning and the optimisations used in the 
FaCT system, which we hope will make the proposed representation viable. Section 
4 describes the experiments done to investigate this viability and the results obtained. 
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The results appear promising and we are building a bibliographic database application 
based on the techniques described, to further investigate the mechanisms within a gen- 
uine application. 



2 Representation of Concepts 



Literature from cognitive psychology supports the idea that when people think in terms 
of concepts, they actually think in terms of prototypical descriptions, rather than in 
terms of strictly necessary characteristics [19]. However using a prototypical descrip- 
tion for a concept descriptor in description logic systems (or any other system based 
on first order logic) will cause problems, as some sub-concepts as well as individuals 
will not have all characteristics of the prototype. In terms of recognising individuals, 
or automatically classifying sub-concepts, the prototypical description of the concept is 
over-defined. On the other hand, use of only necessary characteristics in defining a con- 
cept results in concepts being under-defined, with consequent lack of discrimination. 

Earlier work by Padgham and others [16, 18] has explored describing concepts us- 
ing two descriptors - a core descriptor for defining the strictly necessary characteristics 
and a default descriptor, (which we will call the prototype descriptor) which is sub- 
sumed by the core and in addition defines the prototypical characteristics. However this 
mechanism does not explicitly offer any assistance in recognising the specific concept 
an individual belongs to in cases where the core is under-defined and the individual 
does not fit the full prototype descriptor. 

We build on this work by also defining a basic descriptor which explicitly captures 
the space of concept descriptions which are sufficiently close to the prototype descriptor 
that individuals subsumed by the basic descriptor should be recognised as instances of 
the concept. 

The form of the basic descriptor is an “or” statement which defines any combination 
of 70% of the “features”^ used in the prototype descriptor. The basic descriptor thus 
subsumes the prototype descriptor and an individual should be recognised as being an 
instance of a concept X based on subsumption by the basic descriptor for X. 

Once users or application developers have defined the core and prototype descrip- 
tors the definition of basic descriptors can be automated. It would also be possible to 
generate descriptors capturing varying levels of agreement with the prototype (e.g. 90%, 
70%, 50%) in different structures, allowing applications to attempt instance inference, 
or recognition of individuals at various levels of closeness to the prototype descriptor. 

Further extensions where characteristics within a prototype can be grouped, requir- 
ing some critical mass in each group, can also be envisaged. However all these refine- 
ments rely on the adequacy of the optimisations being explored to provide computa- 
tional viability when relatively large ‘or’ clauses are routinely used. 



^ Further investigation is needed regarding constraints that may need to be placed on the form 
of prototype descriptors. However this is outside the scope of the initial explorations presented 
in this paper. The agreement level of 70% may also be subject to variation. 
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3 Subsumption Involving Disjunction 

Description Logic systems provide a range of automated reasoning services, in particu- 
lar inferring subsumption and instantiation (instance-of) relationships. Subsumption is 
the class/super-class relationship between concepts, while instantiation is the relation- 
ship between individuals and those concepts of which they are instances. The use of 
subsumption inference to build a concept hierarchy (partial order) is known as classi- 
fication and the use of instantiation inference to determine the classes each individual 
belongs to is known as recognition. 

A standard Tarski style model theoretic semantics is used to interpret descriptions 
and to justify inferences. The meaning of concepts and roles is given by an interpretation 
I which is a pair {A^, -^), where A^ is the domain (a set) and is an interpretation 
function. The interpretation function maps each concept to a subset of each role 
to subset of A?' x zi^, and each individual to a unique element of A?- . More complex 
descriptions can be built up by combining descriptions using a variety of operators, with 
the semantics of the resulting description being derived from its components. 

A concept C is subsumed by (is more specific than) a concept D (written C C £>) 
if it can be inferred that C for ail possible interpretations I. The result of 
classification procedures based on the subsumption relation is typically cached in the 
form of a directed acyclic graph called the concept hierarchy or taxonomy. 

An individual x is an instance of a concept C (written x G (7) if it can be inferred 
that e for all possible interpretations I. In many cases, instantiation reason- 
ing, (or recognition), can be reduced to subsumption reasoning using either precomple- 
tion [11] or encoding [8] techniques; for this reason most recent studies have concen- 
trated on subsumption reasoning. We follow this tradition and explore the tractability 
of recognition by obtaining experimental results for appropriate subsumption tests. 

Most modern DL systems^ perform subsumption reasoning by transforming the 
subsumption problem into an equivalent satisfiability problem: C C £> if and only if 
the concept description (C D -iD) is not satisfiable. The satisfiability problem can then 
be solved using a provably sound and complete algorithm based on the tableaux calcu- 
lus [20]. This approach was first described for the ACjC- DL and its practical application 
was demonstrated by the Kris system. 

The FaCT system uses an optimised implementation of a tableaux algorithm to 
perform subsumption reasoning. Like other tableaux algorithms it either proves the sat- 
isfiability of a concept C by constructing an example interpretation in which has 
at least one member, or proves its unsatisfiability by demonstrating that all attempts to 
construct an example must lead to a contradiction. When C contains disjunction, trying 
to construct an example interpretation is non-deterministic. Earlier DLs dealt with this 
non-determinism by naively performing an exhaustive depth first search, and it is this 
which leads to the poor performance of the Kris system with highly disjunctive con- 
cepts. Although it still performs an exhaustive search, the FaCT system includes a range 
of optimisations which can dramatically reduce the size of the search space — these in- 
clude the normalisation and encoding of concept descriptions, an improved search al- 

^ At least those which provide sound and complete reasoning. 
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gorithm, the use of heuristics to guide the search, dependency directed backtracking, 
and the caching and re-use of partial results. 

3.1 Example 

A simple example illustrates the vital importance of optimisation techniques with the 
kinds of basic concept descriptors that will be generated using the representation dis- 
cussed in section 2. 

We will take a simple prototypical concept description consisting of only four “fea- 
tures” 3/i.Ci n 3/2.C2 n 3/3.C3 n 3/4.C4, where each of the Ci is a conjunction of 
three primitives such as Pn f 1 n Pi^, and generate a basic descriptor Cv that will 
subsume any conjunction containing at least two of the 3/j.C'i terms: 

C, = (3/1 .Cl n 3/2.C2) u (3/1. Cl n 3/3.C3) u 
(3/1. Cl n 3/4.C4) u (3/2.C2 n 3/3. C3) LI 
(3/2.C2 n 3/4.C4) u (3/3.C3 n 3/4.C4) 

When classifying a concept D i 3/i .Ci II3/2.C2, it will be necessary to determine 
if C„ subsumes D. As described above, this will be transformed into a satisfiability test: 
C„ subsumes D iff D FI -iC„ is not satisfiable. As a result of its being negated, the C„ 
part of this description becomes a conjunction of disjunctive clauses: 

(3/i.Cin3/2.C2)n 

(V/i.-nCi u V/2.-^C2) n (V/i.-Ci u V/3.-C3) n 
(V/1.-.C1 u n (V/2.-C2 u V/3.-1C3) n 

(v/2.~'C2 u V/4.-1C4) n (V/3.-1C3 u 'ifi.-'Ci) 

To test the satisfiability of this concept, a naive tableau algorithm would try to build 
an example interpretation by proceeding roughly as follows: 

1. Initialise the interpretation to contain a single individual xq which satisfies the con- 
cept. Expand all of the conjunctions, making it explicit that xq satisfies each of 
3 /l.Cl,...,(V/ 3 .-C 3 UV/ 4 .-C 4 ). 

2. Search for a consistent expansion of the disjunctive concepts. Expand each unex- 

panded disjunction by selecting one of the disjuncts, backtracking and trying the 
other disjunct if that fails (leads to a contradiction). Typically, V/i.->Ci would be 
chosen from the first disjunction, V/2.~'C2 from the fourth disjunction (disjunctions 
2 and 3 are satisfied by the first choice), and from the last disjunction.'* 

3. Expand the 3/j.Ci terms one at a time. For 3/i.Ci, this means creating a new 
individual x\ satisfying the concept C\ and related to zo by the role fi. Due to the 
V/i .-iCi chosen from the first disjunction, xi must also satisfy ->Ci . This seems to 
be an obvious contradiction, but as C\ is actually the conjunction Pn D P12 fl P13, 
and -iCi is the disjunction -iPn U ->Pi2 L) -'P13, discovering the contradiction 
in xi will mean expanding the conjunction and then searching the terms in the 
disjunction to discover that each choice leads to a contradiction with one of the 
expanded conjuncts. 

'' Completing all propositional reasoning before expanding 3R.C terms minimises space re- 
quirements [12]. 
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4. Having discovered this contradiction, the algorithm will backtrack and continue 
searching different expansions of the conjunctions which Xo must satisfy until it 
discovers that all possibilities lead to contradictions. It is then possible to conclude 
that D n -i(7„ is not satisfiable, and that thus subsumes D. 

There are several obvious inefficiencies in this procedure, and some not so obvious. 
In the first place, there is the problem of the late discovery of “obvious” contradic- 
tions, for example when a complete (non-deterministic) expansion of C\ and -'Ci is 
performed in order to discover the contradiction in xi . This is a consequence of the 
fact that most tableaux algorithms assume the input concept to be fully unfolded (all 
defined concepts are substituted with their definitions), and in negation normal form 
(NNF), with negations applying only to primitive concepts [12]. Arbitrary ACC con- 
cepts can be converted to NNF by internalising negations using DeMorgan’s laws and 
the identities -i^R.C = 'iR.-^C and -NR.C = 3i?.-iC. 

The Kris system uses lazy unfolding to deal with the problem of late discovery, 
only unfolding and converting to NNF as required by the progress of the algorithm. 
Thus if C\ were a named concept (introduced by a concept definition statement of the 
form Cl = Pu n Pi 2 n Pis), then its unfolding would be postponed and the con- 
tradiction between Ci and ->Ci immediately discovered. FaCT takes this idea to its 
logical conclusion by giving unique system generated names to all compound concepts. 
Moreover, the input is lexically analysed to ensure that the same name is given to lexi- 
cally equivalent concepts. This means that the concepts 3 fi .Ci and V/i .-iCi would be 
named A and -<A respectively (for some system generated name A), and a contradiction 
would be detected without the need to create xi . 

Another problem with the naive search is that the same expansion can be explored 
more than once. For example, after some backtracking the algorithm will determine 
that choosing 'ifs-^Cs from the fourth disjunction always leads to a contradiction and 
will try the second choice, 'ifs.-'C%. Expanding the fifth disjunction will then lead to 
Mfs'-'Cs being chosen, an identical solution to the first one. FaCT avoids this prob- 
lem by using a semantic branching search technique adapted from the Davis-Putnam- 
Logemann-Loveland procedure (DPL) commonly use to solve propositional satisfiabil- 
ity (SAT) problems [7, 9]. Semantic branching works by selecting a concept C such that 
C is an element of an unexpanded disjunction and -<C is not already in the solution, 
and searching the two possible expansions obtained by adding either C or ->C. Wasted 
search is avoided because the two branches of the search tree are strictly disjoint. For 
example, when the choice of V/i.-^Ci, V/2.-'C72 and V/3.-1C3 leads to a contradiction, 
subsequent backtracking will cause the choice of V/2.-'C72 to be changed to -'Vfs'-'Cs, 
so the first solution can never be repeated. 

Finally, after the discovery of the contradiction in xi, the naive search continues 
with chronological backtracking in spite of the fact that the contradiction was caused 
by V/i .-iCi , the first choice made. FaCT deals with this problem by using backjumping, 
a form of dependency directed backtracking adapted from constraint satisfiability prob- 
lem solving [4]. Each concept is labelled with a dependency set indicating the branching 
choices on which it depends, and when a contradiction is discovered the algorithm can 
jump back over intervening choice points without exploring alternative choices. 
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4 Empirical Investigations 

An empirical evaluation was performed in order to determine the viability of using a 
real knowledge base developed using the representational model described in section 2. 
This evaluation used synthetically generated data in order to evaluate the performance 
of FaCT and to determine if the optimisation techniques described in Section 3.1 would 
be sufficiently powerful to permit empirically tractable reasoning with respect to the 
kinds of subsumption problem that would be encountered. The tests were also run using 
Kris in order to identify levels which have previously caused problems, and as a way 
of identifying cases where FaCT may involve extra cost. 

The testing used a variation of a random concept generation technique first de- 
scribed by [9] and subsequently refined by 

[15]. The generated concepts are of the form 3/i.Ci n . . . n 3ft.Ct, where each /» is 
an attribute (single valued role) and each Ci is a conjunction of n primitive concepts 
chosen from N possibilities. 

For a given concept C and an approximation value V in the range 0-100, a concept 
Cy is formed, as in Section 3.1, consisting of a disjunction of all possible conjunctions 
containing F% of the 3/j.C< terms in C.^ To represent the (hardest) kind of subsump- 
tion test that would be involved in the recognition process, a second concept Cr is 
formed from C by changing elements of the Ci from each Bfi.Ci term so that Cr is 
subsumed by Cy with a probability P, and the time taken to test if Cy does in fact sub- 
sume Cr is measured. Varying I (the number of “features”) and V gives disjunctions of 
varying size, and varying P allows performance to be measured for tests ranging from 
“obvious” subsumption to “obvious” non-subsumption. 

Initial explorations indicate that for a variety of applications the number of default 
features is likely to be in the range of 10-15, while the percentage match required is 
likely to be about 70%. Tests were performed for the 9 sets of values given in Table 1, 
with n = 4 and iV = 6 in all cases. For each test, P was varied from 0-1 in steps of 0.05, 
with 10 randomly generated subsumption problems being solved at each data point, 
giving a total of 210 subsumption problems in each test. All the tests were performed 
on 300MHz Pentium machines, with Allegro CX 5.0 running under Linux, and in order 
to keep the CPU time required within reasonable limits a maximum of 1,000s was 
allowed for each problem. 



Tiible 1. Parametric values for tests 



Test 


T1 T2 T3 T4 T5 T6 T7 T8 T9 


e 

V{%) 


5 5 5 10 10 10 15 15 15 
90 70 50 90 70 50 90 70 50 



Tests T1-T3 proved relatively easy for both FaCT and Kris, with both systems 
able to solve any of the problems in less than 0.1s of CPU time. This is not particularly 

^ The number of terms is rounded down to the nearest integer. 
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Fig. 1. Percentile times for T9 with Kris and FaCT 



surprising as, even for T3, will be a disjunction of only 10 conjuncts, each of which 

is of size 2. Tests T4 and T7 also proved relatively easy, with both systems able to solve 
any problem in less than 0.3s of CPU time. This is again due to the small size of the 
disjunctions, resulting in this case from the 90% approximation value. 

For tests T5 and T6 the difference between FaCT and Kris became more evident. 
For T5, FaCT is able to solve >90% of problems in less than 0.3s, while for T6 this 
increases to 0.4s. With Kris, the time taken to solve a problem critically depends on 
whether subsumes Cr (i.e., Ct n ->C„ is unsatisfiable) or not. For T5 most non- 
subsuming (satisfiable) problems are solved in less than 0.1s whereas subsuming prob- 
lems take more than 3.5s, while for T6 these values are 0.1s and 21s respectively. Kris’s 
faster time for non-subsuming problems is due to the fact that, in most cases, a solution 
can quickly be found regardless of the search strategy; FaCT, on the other hand, still 
has the overhead of its more sophisticated search techniques, and in particular of the 
lexical analysis and naming of sub-concepts. 

For tests T8 and T9, Kris’s difficulty with subsuming problems becomes critical 
and it proved unable to solve any such problem within the 1,000s of CPU time allowed. 
FaCT remained consistent with respect to both subsuming and non-subsuming prob- 
lems, solving >90% of problems in less than 9s for T8 and less than 28s for T9, with 
FaCT’s worst time in all tests being 31s. Figure 1 shows the 50th percentile (median) 
and 90th percentile® times for T9 with Kris and FaCT plotted against the probability 
of generating subsuming concepts. Note that where the CPU time is shown as 1,000s 
no solution was found, and the time which would be required in order to find a solution 
could be » 1,000s. 

Kris’s poor performance is easily explained by the fact that for T8, will be a 
disjunction of 3,003 conjuncts, each of which is of size 10. When Cv is negated in the 
subsumption test this becomes a conjunction of disjuncts which, using a naive strategy, 

® The 90th percentile is the maximum time taken to solve all but the hardest 10% of problems. 
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leads to a search of possible expansions (although only of these can be 

unique); for T9 will be a disjunction of 6,435 conjuncts, each of which is of size 7. 

5 Discussion and Conclusions 

Clearly the results using FaCT on the larger disjuncts (in tests T8 and T9) are encourag- 
ing compared to Kris, indicating that frequent use of optimised disjunctive reasoning 
is potentially viable. To ascertain whether the very significant gains are sufficient to 
justify the proposed representation in real applications, some further questions should 
be considered: At what rate do individuals need to be categorised? Will one instance 
inference, or recognition process, lead to further instance inferences? How many sub- 
sumption tests are needed for an instance recognition? How likely is it that the more 
difficult subsumption tests wilt occur? 

An additional question also has to do with space complexity. The naive representa- 
tion of the conceptual representation described results in exponential increase in space 
requirements. However we would expect to adapt existing techniques which only re- 
quire keeping part of the concept hierarchy in memory, and to expand concepts to their 
full representation only at run-time. The exponential space increase will not result in 
exponential time increase using the described optimisations, due to the fact that most of 
the increase is in equivalent concepts which are pruned away. 

The rate at which instance inference needs to be done can vary widely depending on 
the application. In a real-time telecommunications fault diagnostic system, individual 
descriptors needing to be classified as normal, or as a particular category of fault, may 
arrive at several per second. On the other hand a support system for medical diagno- 
sis, being used by an individual doctor, could reasonably expect a descriptor of patient 
symptoms every 10 minutes. The rate for a bibliographic or travel KB, responding to 
user queries probably lies somewhere between these two. The experimental response 
times we have established are clearly adequate for some applications, but possibly in- 
adequate for others.^ 

Applications with highly interrelated individuals can result in significant propaga- 
tion when a single individual is modified. Consequently one recognition process can 
trigger several other such processes. Some applications (such as a bibliographic data- 
base or a travel information database) rely on a large set of individuals many of which 
may be interrelated. However, other applications (such as the medical diagnostic sup- 
port described in [18], where individuals are descriptions of a set of patient symptoms) 
mostly deal with individuals which have no effect on other individuals and thus can 
only result in the subsumption tests necessary for a single recognition problem. 

The number of subsumption tests required for a particular instance recognition task 
depends on both the number of concepts and the form of the hierarchy. Assuming that 
the hierarchy is close in form to a tree, and that individuals typically belong to only one 
sub-class at each level (at least until the bottom levels of the hierarchy are reached), then 
the number of subsumption tests needed at each level will be equal to the fan-out of the 

^ Although if inadequate response times occur relatively infrequently it may be possible to 
achieve usability by supplementing the optimisation techniques with special purpose heuris- 
tics. 
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hierarchy at that level. Consequently, the total number of subsumption tests required 
will be roughly the average fan-out multiplied by the depth of the tree. Moreover, FaCT 
uses a caching optimisation to facilitate the quick discovery of non-subsumption, and 
this will typically work for all but one test at each level [13]. This effectively reduces 
the number of “full” subsumption tests to be equal to the depth of the tree. 

The form of the hierarchy generated using the representation described in Section 2, 
with 3 descriptors per concept, obviously increases the number of nodes in the hierar- 
chy by a factor of 3. It is also possible that the form of the hierarchy differs from 
concept hierarchies with which we are familiar, due to the various nuances of A is-a B 
which become available. For example in Figure 2 the hierarchy on the left represents 
the case where As are typically Bs, whereas the hierarchy on the right represents the 
case where As are always Bs. Further work is needed to determine the form of appli- 
cation taxonomies using this representation, but it is unlikely that the number of hard 
subsumption tests required per recognition task will change significantly: only the ba- 
sic descriptors are highly disjunctive, and the caching optimisation should still allow 
“full” tests to be avoided in most cases. It is also likely that further optimisations can 
be developed, based on the particular representations we are using. 
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Fig. 2. Two nuances of A is-a B with 3 descriptors 



The experimental subsumption problems generated were deliberately designed to 
be difficult, and it is unclear how often such problems would be encountered in a KB 
using the representation proposed (it is likely they would be more common than is usual 
for difficult subsumption problems in KBs not routinely using this representation). The 
best case would be that such difficult subsumption tests would be encountered only 
very occasionally, and never more than one per individual recognition process. Given 
that ontologies tend to be much broader than they are deep, typically with a depth in 
the range of 7 to 14, this would give (for the T9 situation) a response time which occa- 
sionally peaked at around 30s; the worst case would be that all “full” subsumptions for 
a given individual were difficult, giving a response time of 7 minutes for a typical hier- 
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archy of depth 14. This may still be acceptable for an application such as that described 
in [18] where the system is being used as a diagnostic support tool for medicine. 

To sum up, even making very pessimistic assumptions leads to a predicted worst- 
case response time of 7 minutes per recognition process. This is clearly within the range 
of useful response times for some applications. As a result of these explorations we are 
convinced that the recent optimisations make routine disjunctive reasoning feasible and 
thus justify using a representational approached based on disjunction. We are in the 
process of building a bibliographic KB application to further explore the representation 
of concepts as described and the associated computational properties. 
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Abstract. Searching for solutions to constraint satisfaction problems 
(CSPs) is NP-hard in general. Heuristics for variable and value ordering 
have proven useful in guiding the search tow 2 irds more fruitful areas of the 
search space and hence reducing the amount of time spent searching for 
solutions. Static ordering methods impaat an ordering in advance of the 
search and dynamic ordering methods use information about the state of 
the search to order values or variables during the search. A well-known 
static value ordering heuristic guides the search by ordering values based 
on an estimate of the number of solutions to the problem. This paper 
compares the performance of several such heuristics and shows that they 
do not give a significant improvement to a random ordering for hard 
CSPs. We give a dynamic ordering heuristic which decomposes the CSP 
into spanning trees and uses Bayesisui networks to compute probabilistic 
approximations based on the current search state. Our empirical results 
show that this dynamic value ordering heuristic is an improvement for 
sparsely constrained CSPs and detects insoluble problem instances with 
fewer baicktracks in many cases. However, as the problem density in- 
creases, our results show that the dynamic method and static methods 
do not significantly improve search performance. 



1 Introduction 

Constraint satisfaction problems [16] consist of a set of variables, a domain of 
values for each variable and a set of constraints between variables representing 
mutually permissive value assignments. A CSP admits a solution when there 
exists a value for each variable such that all the constraints are satisfied. Many 
problems in artificial intelligence can be modeled as CSPs. Some examples in- 
clude scheduling, machine vision, planning, image recognition, scene analysis and 
configuration. The constraints in a CSP can be over one or more variables how- 
ever, we restrict our study to CSPs with binary constraints where all constraints 
are between two variables without loss of generality [10]. 

Constructive search algorithms such as backtracking [5], maintain a set of 
assigned variables which are a partial solution satisfying all of the constraints. 
At each iteration, a variable which is not currently assigned is chosen and a value 
from its domain assigned. If no values in the domain are consistent with the 
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partial assignment, the previous assignment is redone. The algorithm proceeds 
in this manner until a solution is found or all possible assignments have been 
visited unsuccessfully. In general searching for solutions to CSPs in NP-hard, 
so it is of great interest to explore heuristics that guide the search towards 
areas of the search space that are likely to contain solutions. Such guides can be 
incorporated into a backtracking algorithm by advising on the next variable to 
instantiate or by ordering the values in a variable’s domain. The order in which 
the variables and their values are considered can be decided prior to searching 
(static ordering) or during the search (dynamic ordering). This paper examines 
the utility of both static and dynamic value-ordering heuristics in search. 

Estimates of the number of solutions to a CSP, originally introduced in [6], 
can be used as value-ordering heuristics by advising on the next move in the 
search. The idea of using an estimate of the number solutions in the subtree 
rooted at a particular instantiation was later applied as a static value-ordering 
heuristic by Dechter and Pearl [2]. Their estimation method reduces the com- 
plexity of a CSP by removing constraints until a spanning tree remains and 
values are then ordered according to the estimated number of solutions. Meisels 
et al. [7] describe another method, based on probabilistic updating in Bayesian 
networks, which was shown to be more accurate on average than the spanning 
tree method. Their estimates were used to derive global solution probabilities 
for each variable-value instantiation. Each variable’s domain is ordered based on 
an approximation to the probability that the instantiation is part of a global 
solution. 

We introduce a new static value-ordering heuristic, called the Multiple Span- 
ning Tree method (MST), which approximates the probability that a CSP is 
satisfiable. The multiple spanning tree method preserves the constraints in the 
network by reducing the complete CSP to a disjoint set of decomposed subprob- 
lems. We then show how it can be extended by incorporating Bayesian networks 
to give a dynamic value-ordering heuristic. 

The remainder of this paper is organized as follows: Section 2 gives a descrip- 
tion of existing probabilistic static value-ordering methods. The first, called the 
Single Spanning Tree method (SST), is based on a decomposition of the CSP into 
a representative spanning tree. The second method is the Uniform Propagation 
method which is an improvement on SST and gives more accurate approxima.- 
tions on average. In Section 3 we introduce the multiple spanning tree method 
(MST) which is an extension to Dechter and Pearl’s SST method. However, in- 
stead of maintaining a single spanning tree from the original CSP, our method 
preserves the complete CSP in the form of a disjoint set of spanning trees. The 
MST method can be used to order values in advance of the search (static MST) 
and we describe a dynamic value-ordering heuristic (dynamic MST) which com- 
bines the multiple spanning tree method with Bayesian networks to approximate 
solution probabilities for branching choices based on the current configuration 
of the search. Section 4 compares the search performance of three static value- 
ordering heuristics and our dynamic one. We compare the methods on problems 
with varying degrees of difiiculty and measure their performance in terms of 
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the number of consistency checks in search. Our empirical results show that the 
static value-ordering methods are of marginal utility as they tend to perform as 
poorly as a random ordering as problem hardness increases. The dynamic order- 
ing detects insoluble problems with fewer backtracks and gives an improvement 
on hard CSPs that are sparsely constrained but the performance degrades as 
problems become more dense. 



2 Previous Work 

A binary CSP is defined by a set of n variables {Xi, ..., Ar„} each having a 
domain Dj of m possible values {ui, C is a set of binary constraints 

between variables where each constraint is denoted as a set Cjk of consistent 
value pairs for constrained variables Xj and Xk respectively. A CSP is satisfiable 
if there exists a set of values {ui, corresponding to variables {Xi, ...,X„} 

such that (vj,Vk) € Cjk for all constrained variables (Xj,Xk)- The number of 
solutions to a CSP consists of the total number of such unique value sets. 

Dechter and Pearl [2] introduce a single spanning tree approximation heuris- 
tic (SST) which estimates the number of solutions to a CSP and use it as an 
advising technique for value ordering. Their method relaxes the constraint net- 
work of the CSP by extracting a spanning tree of the tightest constraints and 
computes the exact number of solutions for the spanning tree as an estimate of 
the total number of solutions to the complete CSP. The algorithm orders values 
for a variable Xj by considering the subproblem rooted at Xj (denoted Gj). 
Each arc in Gj is assigned a weight representing the number of compatible 
value pairs for that binary constraint. The constraints are then relaxed to form 
a maximum- weighted spanning tree which is used as an approximation to Gj. 
Once the spanning tree has been established, the algorithm computes the number 
of solutions for each of the possible value assignments for the current variable. 
An estimate of the number of solutions with variable Xj (the root node of the 
spanning tree) instantiated to value vt is computed recursively as: 



where 



7V(X, =ut)- J]5];iV(Xe = Ui) 

C D' 



C' = {c\Xcisachildof Xj} 



D' = {uj e Dc\ivt,vi) e Cjc} 

The algorithm computes the number of solutions starting from the leaves 
and working towards the root. The spanning trees for each level in the complete 
search tree are computed prior to searching and thus value-orderings for each 
variable are determined statically. 

The decomposition method of Dechter and Pearl has several identifiable 
shortcomings that affect the accuracy of the approximations. One problem is 
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that when the constraints of a given CSP of are equally tight, SST fails to use 
any criteria for selecting constraints for the decomposed CSP and thus the repre- 
sentative spanning tree is chosen arbitrarily. Another problem with SST is that 
tightly constrained problems with many variables may suffer from a lack of ac- 
curacy in their approximation. For example, consider a CSP having constraints 
of equal tightness between each variable. Such a CSP has constraints. 

A spanning tree consists of n — 1 arcs, thus Dechter and Pearl’s decomposition 
method only preserves ^ of the constraints from the original CSP. 

Meisels et al. [7] describe a method called Uniform Propagation which ap- 
proximates the number of solutions to a CSP without relaxing the constraint 
network to form a spanning tree. Their approximation method formulates a 
CSP as a Bayesian network and uses an algorithm based on probability updat- 
ing to estimate the number of solutions to the CSP. It approximates for each 
variable-value assignment the probability of being part of a global solution to 
the CSP. This is represented as P(Xi = vj) for a variable X{ and a value Vj. In 
effect, this is an approximation of the ratio of the number of complete solutions 
that include these particular assignments. 

The constraint network for a CSP is converted into a directed acyclic graph 
(DAG) where an edge (Xj,Xi) between constrained variables Xj and Xi is de- 
fined when i < j. Thus, Xj is a predecessor of Xi and Xi is the designated sink 
node which has no successors. The marginal probabilities for each variable repre- 
sents the corresponding success probabilities for the domain values conditioned 
on all of the predecessor nodes in the network. To compute the probabilities for 
an arbitrary variable Xi, the network must be organized such that Xi is the sink 
node (ie. Xi = Xi) 

Starting at a node in the network (called the root node) and considering 
every node towards the designated sink node Xi, the marginal probabilities for 
each node in the network are conditioned on its predecessors. Predecessors of a 
node Xi are nodes connected through a constraint and whose marginal proba- 
bilities have already been computed. To ensure that the marginal probabilities 
for each node will be considered either directly or indirectly at the sink node, 
vacuous constraints are added to the network which permits any pair of values 
between Xj and Xi. Vacuous constraints are added between the sink and every 
other node in the network that does not have any successors. 

Once the constraint network of the CSP has been converted into a DAG with 
an appropriate ordering, the approximation algorithm is applied to determine 
the probabilities of the designated sink node Xi. The marginal probability of 
a variable instantiation Xj = Vt being part of a global solution where 7r(Xj) 
denotes the predecessor of node Xj is denoted P{Xj = Vt) and computed recur- 
sively as: 



ii>b n E 

^ {c|Xc67t(X^)} {v,enc|(w,.w,)€C3„} 

This algorithm assumes conditional independence between predecessors of 
node Xj. This assumption temporarily removes edges causing cycles between Xj 
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and its immediate predecessors and allows the probabilistic approximations to 
be computed in polynomial time. The effect of this independence assumption 
is that a margin of error is introduced into the probabilities. The probabilistic 
approximation method is used as a static value-ordering heuristic by imparting 
an ordering on the domain of each variable. The probabilities for each variable- 
value combination are computed in advance of the search by making each variable 
in turn the sink node and running the uniform propagation algorithm. This 
ordering heuristic was found to perform slightly better than the SST method on 
an experimental set of problems (as discussed later). 



3 Multiple Spanning Tree Method 

As previously mentioned, one of the major limitations of the approximation 
method of Dechter and Pearl is that the decomposition is generally not an accu- 
rate representation of the original CSP. It was shown that their method gives es- 
timates that are over-optimistic for non-tree CSPs [7] and performs quite poorly 
in practice [13]. The Uniform Propagation method of Meisels et al. is an im- 
provement. Their method maintains the topology of the constraint network and 
gives more accurate approximations by preserving all the constraints. 

We introduce a new approximation method which decomposes the constraint 
network of a CSP into a set of spanning trees of subproblems. Contrary to the 
decomposition method of Dechter and Pearl, our Multiple Spanning Tree (MST) 
method maintains all of the constraints from the original CSP. Approximations 
are computed for each of the subproblems and are then composed giving ap- 
proximations to the complete CSP. Our approximation algorithm is composed 
of two components: Decomposition and Approximation. 



Decomposition Given a CSP C and its corresponding constraint graph Q, our 
algorithm first decomposes Q into a set of N spanning trees. The decomposition 
algorithm extracts at each iteration i, a minimum spanning tree Cj from the 
constraint graph Q in the same way as the SST method of Dechter and Pearl 
by choosing first those constraints which are tightest. The constraints chosen 
for Si are then removed from Q before the next iteration. This decomposition 
thus preserves all of the constraints of the original network. That is, 

Q ~ Cl U C2 U ... U C/sF (1) 

where 

Cl n C2 n .. n Cat = «!> (2) 

Approximation Once the CSP C has been decomposed, we compute the prob- 
abilities for each Cj. It was shown by Meisels et al. [7] that the SST and Uniform 
Propagation methods compute probabilities exactly and equivalently for trees in 
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polynomial time. Bayesian networks are also known to admit these exact prob- 
abilities for trees in linear time [11], [8]. We represent each subproblem (Ci) as 
a Bayesian network and compute their exact probabilities. These are then com- 
posed to give an approximation to the exact probabilities of the original CSP. In 
computing the probabilistic approximation for C we make the general assump- 
tion that the subproblems Ci through Cm are independent. This assumption is 
based on the observation that they do not share constraints and thus can be 
regarded as a set of disjoint subproblems of C. It is important to note however, 
that the subproblems are disjoint in terms of their constraints only. It terms of 
their respective solutions sets, they are obviously not independent since a global 
solution to C must also be a solution in each Ci. For each of the individual sub- 
problems Ci, Pci{X = v) computes the the probability that X = u is part of a 
solution to Ci- For a given variable X and value u in P, we thus approximate the 
probability P(X = v) as the probability that X = v is part of a global solution 
to V where; 



P{X = v)= Pc, iX = v )- ... • (X = v) (3) 

We now formalize our independence assumption and give a derivation of the 
preceding formula 3. We define S as the set of solutions to the CSP C and P(S) 
as the probability that a solution is achieved given a complete instantiation 
of values to variables. P{S) can also be interpreted as the probability that a 
complete random instantiation is in the global solution set S. P{S) is thus the 
total number of solutions (|5|) over the configuration space for C: 

P(S) = (4) 

Given a solution set {5i, ..,Sn} corresponding to the decomposed subprob- 
lems {Cl, ..., Cm} ■ 



5 = 5i n 52 n ... n Sm (5) 

Essentially, if a configuration to the CSP C is in each Si, it must satisfy all 
the constraints to C (by equations 4 and 5) and is thus an element of the global 
solution set 5 for C. 

The second assumption we make in our model is that a global solution to 
the CSP does exists and thus S ^ <j). This assumption is an artifact of our im- 
plementation. We use Bayesian networks to compute the exact beliefs for the 
subproblems and our representation assumes that each constraint in the CSP is 
not violated. Bayesian networks must maintain a consistent set of prior proba- 
bilities and thus a violated constraint violates the Bayesian network. It follows 
from equation 5 that S\,...,Sm That is, each decomposed subproblem Ci 
is also assumed to contain a solution. Our approximation P{X = v) is thus 
conditioned on S giving: 



P{X = u|5) 



( 6 ) 
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By the inverse of the product rule we get: 

P{X = vf\S) 

P{S) 

We then apply the product rule to the numerator: 

P{S\X = v) • P{X = v) 

P(S) 

and by equation 5: 

PjSi n - n Sn\X = u) • P{X ^ v) 

P{Si n ... n Sn) 

By our independence assumption we get: 



( 7 ) 

( 8 ) 
( 9 ) 



P(5i) ■■■ P{Sn) ^ ^ ^ 



Rearranging this equation using Bayes’ law we get: 



(10) 



P(X = u|5i)--.P(X = u|5jv) 
P{X = 



( 11 ) 



Our approximation algorithm normalizes the approximations over a domain 
to sum to 1 since we assume that a solution S exists. Thus the term P{X = v)^~^ 
is essentially eliminated from equation 11 leaving: 



P{X^v\Si)---P{X = v\Sn) (12) 

The MST method thus computes probabilistic approximations for each sub- 
problem and composes these through an independence assumption to approxi- 
mate the probability for the complete CSP. 

The MST method can be used to compute solution probabilities prior to 
search and thus give a static ordering to values (static MST). The SST and 
Uniform Propagation methods are also static ordering methods, but can only 
be implemented as such. We can however, use our approximation method as a 
dynamic value ordering heuristic. Previous work has shown that dynamic value- 
odering can improve performance on certain types of CSPs [15], [3]. 

Our dynamic heuristic (dynamic MST) uses the decomposition and approx- 
imation of static MST but dynamically maintains the current search state as 
evidence in the networks of the subproblems. Each instantiated variable in the 
CSP is part of the evidence set e and we use Bayesian networks to derive infer- 
ences based on evidence about the current state of the network for each subprob- 
lem Ci- Computing exact beliefs in singly- connected Bayesian networks, where 
there exists only 1 path between nodes, is linear in the number of nodes in the 
network [11], [8]. Since each Ci is a tree, belief updating on the subproblems 
is a tractable computation. Our algorithm is thus dynamic in the sense that 
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the approximations are based on a set of evidence (e) about the current partial 
assignment. For a given variable X and value v, we evaluate P{X = v\e). 

Our approximation heuristic is incorporated into a backtracking algorithm 
to advise on the next most likely move to lead to a solution. At an instantiation 
point, we consider the next value to assign to a variable based on the correspond- 
ing beliefs associated with each value in the variable’s domain. We first select 
that value which has the highest probability or support. Upon choosing a value 
for a variable, we enter that instantiation as evidence in each of the subproblems 
and update their beliefs. 

Our heuristic is inserted into a backtracking algorithm with very little effort. 
We define a procedure Advise(AT,e) shown in Figure 1. The Advise(X.e) procedure 
is called when considering a new variable to give advice on the search order 
over the domain. It takes as an argument the current variable X as well as the 
current instantiation or evidence e and returns a vector of values representing 
the search order for the domain of X. This procedure first calls the Bayesian 
belief algorithm on each subproblem Ci to get the exact probabilities over the 
variables given the evidence. Beliefs for the original CSP are then approximated 
through the independence equation of 3 in section 3.2. The next step of the 
Advise procedure creates the domain vector by adding as elements all values 
having non-zero probabilities and filtering out those known to be non-solutions. 
This step reflects the forward checking and arc-consistency algorithms described 
in the following section. We then call a sorting procedure (Sort(domain,P(X))) 
which arranges the values in the domain vector according to their corresponding 
probabilities and returns the ordered vector of values. 

vector Advise(X.e) { 
vector domain = null; 
for each Ci do 

Update.Beliefs(Cj,e); 
for each value v do 
P{X = v) = 1; 
for each Ci do 

P(X = v)* = Pc, (X = v); 
for each value v do 

if (P(X = v) > 0) then 
domain. add(v); 
if (domain!=null) then 

domain=Sort(domain,P(X)); 
return domain; 

} 

Fig. 1. Advising algorithm for value-ordering 



We use a commercial software package called Netica [9] to maintain the prob- 
abilities and compute beliefs for the Bayesian networks. It provides functionality 
through an API for entering and removing evidence for a variable instantiation 
and backtrack respectively. For an instantiation X = u, we enter the finding for 
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each Ci and for a backtrack, we undo the finding by reverting back to the state 
of each Ci before X — v was admitted. An instantiation occurs in the back- 
tracking algorithm when choosing a value in the domain vector returned from 
the Advise(A,e) routing. The 2 conditions for backtracking are when Advise(A’,e) 
returns an empty vector or there are no values left in the domain vector to in- 
stantiate. 



4 Experiment and Results 

Previous analysis comparing the SST and uniform propagation methods was on 
relatively small experimental problem sizes and did not grade problems based on 
their difficulty [7]. This resulted in limited insight into the performance of each 
method in varying regions of problem hardness. For the experimental analysis 
of each approximation method we use as test cases three categories of problems 
with increasing levels of difficulty. 

The experimental problem set consists of randomly-generated CSPs char- 
acterized by four parameters: n, the number of variables; m, the domain size; 
Pi, the probability of a constraint between a given pair of variables; p 2 the 
probability that a pair of values for two variables is inconsistent given that a 
constraint exists. We restrict the size of our problems to CSPs with n = 20 and 
m = 10 and divide the problems into 3 sets varying in degrees of constrainedness 
where pi = {0.2, 0.5, 1.0}. The p 2 value for each set ranges from .01 to 1.0 in 
steps of ,01 representing varying degrees of problem constrainedness. We gen- 
erate 20 problems for each p 2 giving 2000 CSPs in each of the 3 sets. We are 
most interested in examining performance on hard CSPs. These types of prob- 
lems can be found in the phase transition region where the solution probabilities 
have been found to decrease rapidly from 1 to 0 [1], [14]. The problems in our 
experiments consist of both soluble and insoluble problems instances so that we 
may generate a cover of this region. 

For our search algorithm we use backtracking with forward checking and 
conflict-directed backjumping (Prosser 1993). To evaluate the heuristics, we con- 
sider the number of backtracks needed to find a solution or prove that no solution 
exists. A backtrack occurs when no values in the current variable’s domain is con- 
sistent with the past instantiations and the assignment for the previous variable 
must be undone. 

Figures 2, 3 and 4 show the results of our experiments on the three prob- 
lems sets. We plot the number of backtreicks on a logarithmic scale against the 
constraint tightness (p 2 ) for each problem set. 

In the easy region where the CSPs are dense with solutions, all the value 
ordering heuristics do have a degree of utility. In the easy regions of Figures 2, 3 
and 4 (the areas left of the peaks), a random value order results in the most 
number of consistency checks on average. In general, the MST and uniform 
propagation methods seem to be the most useful in the easy region while the 
SST method is closest to random. 
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Fig. 2. Average Number of Backtracks for Sparsely Constrained CSPs (pl=0.2) 




Fig. 3. Average Number of Backtracks for CSPs of Medium Constrainedness 
(pl=0.5) 



There is very little difference between each of the static value-ordering heuris- 
tics especially in the hard regions (identified as the peaks in each of the graphs). 
It is important to note that almost all of the problems represented in the regions 





350 Matt Vernooy amd William S. Havens 




Fig. 4. Average Number of Backtracks for Densely Constrained CSPs (pl=1.0) 



to the right of the peaks are insoluble. For insoluble problems, static value- 
ordering heuristics are of no use since all possible configurations in the search 
space must be considered, thus all of the static methods are equal in the number 
of backtracks. 

The dynamic method (dynamic MST) however, is capable of detecting insol- 
uble instances in over-constrained problems with fewer backtracks. The reason 
for this is that in each of the decomposed subproblems C\,..,Cn we compute 
the exact probabilities. 

Instantiations that have a zero probability in a subproblem are interpreted 
as variable- value assignments that cannot be part of a global solution given the 
current set of instantiations. Instantiations with zero probabilities in the sub- 
problems are eliminated from the complete CSP (when we compute the approx- 
imation in equation 5) and thus the overall size of the search space is reduced. 
In some over-constrained cases, it is possible that one of the decomposed sub- 
problems Ci is insoluble and we can conclude without searching that no solution 
exists to the complete CSP (from equation 5). This savings is seen in the regions 
to the right of the peaks in each of the graphs. 

However, dynamic MST seems to lose its advantage as the constraint den- 
sity of the problems increases. In Figure 2, dynamic MST is noticeably better 
than any of the static methods in the hard region and as problems become over- 
constrained. However, in Figures 3 and 4, the dynamic method becomes increas- 
ingly closer to a random ordering. We conjecture that the independence assump- 
tion for our MST methods is stronger for CSPs that are sparsely constrained and 
begins to break down as CSPs become more dense and less tree-like. Indeed our 
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independence assumption holds for trees since our decomposition results in one 
subproblem (the original CSP). In this case, our probabilistic approximations are 
exact and searching for solutions and non-solutions is backtrack-free. However, 
as the problems become more dense and less tree-like, our assumptions begin 
to break down and the number of backtracks increases to the point where the 
approximations are no longer useful in guiding the search (as seen in Figure 4). 

5 Conclusions 

We have given an empirical study of value-ordering in search and our results sug- 
gest that such heuristics do not improve performance on hard sets of problems. 
We’ve shown that as problem hardness increases, the utility of each approxima- 
tion method as value-ordering advice decreases to the point where they perform 
as poorly as a random ordering in guiding the search. 

We also introduced a dynamic value-ordering heuristic using Bayesian net- 
works and evidence about to the current search configuration to approximate 
solution probabilities for CSPs. This method detects insoluble problem instances 
for over-constrained CSPs with fewer backtracks on average than static methods. 
Also, the dynamic method was shown to perform close to an order of magnitude 
better than static heuristics in the hard region of sparsely constrained CSPs. 
However, as the density of CSPs increases, the performance of our dynamic 
heuristic in the hard region becomes close to that of a random ordering. 
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Abstract. Binary decision diagrams (BDDs) are graph-theoretical, 
compact representation of Boolean functions, successfully appUed in the 
domain of expert systems for pr 2 ictical VLSI design. The authors have 
been developing the methods of using BDDs for expert systems that me- 
chanically try to prove the termination of rule-based computer programs. 
To make BDD representation really practical, however, we need good 
heuristics for ordering Boolean variables and operations. In this paper, 
we will present some heuristic methods that could affect the performance 
and evaluate them through the comprehensive experiments on sample 
rule- based programs taken from practical domains such as hardware di- 
agnosis, softweure specification, 2 ind mathematics. The results show the 
big difference among the heuristics and provide us useful information for 
optimizing the overall systems. 



1 Introduction 

Binary decision diagrams (BDDs) [1,2] are graph-theoretical, compact represen- 
tation of Boolean functions, successfully applied in various fields of AI. In par- 
ticular, its application in the field of practical VLSI design expert systems [10] 
has demonstrated the power of AI to many researchers and engineers outside 
the AI community. Many other works have been done for applying BDD tech- 
nology to AI and other engineering fields. For example, in [9], BDD is used as 
a key technology for a truth maintenance system, and in [4], BDD is applied to 
the reliability analysis of huge and complex plant systems. This technology also 
interests the community of automated theorem proving [11]. 
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Education Ministry of Japan; No. 09650444 for the first author and No. 09780231 
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However, it should be stressed that although this technology is quite general 
and may seem to be easily applied in various fields, it is known that it essen- 
tially contains some computationally hard problems to be solved, depending on 
the application domains. One of such problems is to determine an appropriate 
variable ordering for constructing HDDs. Without good variable orderings, the 
size of the HDDs could exponentially grow too large. Unfortunately, there are 
no efficient algorithms to compute a good variable ordering that minimizes the 
HDD size, because, as is often the case in typical AI problems, this problem is 
proved to be one of the NP-complete problems. Thus, all we can do is find a 
good heuristics for solving such a problem. Indeed, such a heuristic approach to 
hard problems is what AI is all about. 

Recently, the authors have developed the framework of applying HDD tech- 
nology in the field of computer software development. More exactly, we have 
developed the Boolean functions for verifying the correctness (termination, in 
particular) of rule-based programs and encoded them as BDDs. However, it does 
not necessarily imply a success in that field. As we mentioned, we need heuristics. 
In this paper, we present some heuristics suggested in the literature and evaluate 
them through comprehensive experiments on sample rule-based programs taken 
from practical domains such as hardware diagnosis, software specification, and 
mathematics. The results show the big difference among the heuristics and pro- 
vide us useful information for optimizing the overall systems. Such information 
has been integrated into Terminator/R, the expert system for verifying the 
termination of rule-based programs developed by the authors. 

2 Preliminaries 

2.1 Binary Decision Diagrams 

A binary decision diagram (BDD), notation BDD{F), is a directed acyclic graph 
representation of a Boolean function F. Given an assignment of Boolean values 
(0 or 1 ) to each variable of F, we can determine the value of the function by 
following the path from the root to a terminal (0 or 1), branching at each Xi 
node to either 0- or 1-labelled edge depending on the assigned value for the 
variable Xj. Usually, we fix a linear order (called a variable ordering) on the set 
of Boolean variables, and for every path from the root to a terminal, we let these 
variables appear in this ascending order. Figure 1 illustrates BDD{xi + x^xz) 
with the variable ordering xi < X2 < X3. 

One of the most important aspects of BDDs is its uniqueness: with the vari- 
able ordering fixed, the BDD of the given function is uniquely determined. As 
a result, every Boolean function is unsatisfiable if and only if its BDD repre- 
sentation is BDD{ 0 ). Another ramarkable aspect of BDDs is its compactness in 
size, compared with truth tables and other canonical representations. Actually, 
many practically important Boolean functions, particularly in the field of VLSI 
design, have been found to be represented in a moderate size of BDD. 

The practically most important point in using BDDs is the choice of the 
variable ordering and the operation ordering. The variable ordering affects the 
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size of the resultant BDDs, but the problem of determining the optimal vari- 
able ordering that yields the smallest BDD is known to be NP-complete. The 
operation ordering defines the order in which primitive and intermediate BDDs 
are combined by AND and OR operations to make the whole structure of the 
final BDD. It affects not on the size of the resultant BDDs, but on the size of 
the intermediate BDDs. Even if the final BDD is small, the intermediate BDDs 
created in the course of the process can be too large to be applied in practice. To 
be practical, therefore, we need good heuristics for ordering Boolean variables 
and operations, depending on problem domains. 

2.2 Termination of Rule-Based Programs 

Verification of correctness (including termination) is one of the most challeng- 
ing and important applications of AI, because this kind of problems are often 
undecidable in general, meaning that there are no general algorithms for solving 
them. Thus, the heuristic, knowledge-based approach is essential. In this paper, 
we consider the application of BDD technologies to verification of the termi- 
nation property of programs written in a language that has simple syntax and 
semantics, suitable for the foundational studies of verification. 

The language is a rule- based language called rewrite rules [3,12]. In this 
language, a program is defined as a set of rules. Each rule is a pair (written in 
the form of f — > r) of terms. The terms may contain constants, function symbols, 
and variables as usual. Given an input term, the program rewrites it repeatedly 
by using pattern matching until you reach an irreducible answer term. 

A program is terminating if there are no infinite rewrite sequences for any 
input terms. Termination is an undecidable property in general, but some suffi- 
cient conditions for its verification have been studied. In this paper, we focus on 
a popular class of such conditions based on precedence. A precedence, denoted 
by is a partial ordering on the set of function symbols. It is extended to a 
partial ordering Xjpo (called the lexicographic path ordering, or LPO) on the set 
of terms. If all rules I — > r satisfy I >-ipo r, we may conclude that the program is 
terminating [7]. 
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2.3 Encoding Termination in BDD 

We encode the sufficient condition of termination based on LPO £is a Boolean 
function. We present three methods of encoding. The first method is based on 
natural encoding of LPO plus the irreflexivity and transitivity of partial orderings. 
The second encoding improves this representation by introducing the notion of 
explicit transitivity and irreflexivity, providing an encoding with less Boolean 
variables. Finally, the last encoding safely removes the transitivity part at the 
small cost of losing minimality of the explicit irreflexivity. In the following, we 
will briefly review these encodings. More detailed account is given in [8]. 

Basic encoding Let T{Ti) be the set of function symbols occurring in the 
program Tl, and X = {xjg\f,g G ^{T^)} be the set of Boolean variables. By an 
assignment of a Boolean value to i/g, we represent the truth of the proposition 
f y g. Thus a precedence >- is represented by an assignment of truth values 
to all the variables in X. Since the precedence is a strict partial ordering, it 
must satisfy the transitivity {T{X)) and irreflexivity {I{X)) represented by the 
following Boolean functions, respectively. 

T{X)= ][[ [Xfg+XgH+Xfh], I{X)= n Xff. (1) 

f,g,henn) ferni) 

The transitivity specifies that if Xfg = 1 and Xgh = 1 then = 1 (for 
all /, g, and h); and the irreflexivity simply means that x// = 0 for all /. 
In [8], the condition I yipo r is encoded in a recursive way as a Boolean function 
denoted by LPOi^ri^)-, the conjunction of such conditions for all rules in a 
program TZ = {k ri \ i = 1 . . . m} is represented by the following function: 

m 

LPOn{X) = \{LPOu,r,{X). (2) 

i=l 

For example, if 7Z consists of a single rule f{h{x)) — » ^(x), then LPOniX) = 
Xhg + Xfg. This means that if h y g or f y g, then f{h{x)) yipo g{x). 

Now we can combine the three functions to yield the following encoding. 

HniX) = LPOniX) ■ T{X) ■ I(X). (3) 

It is proved that if Hti(X) is satisfiable, then TZ is terminating [6]. Note that 
the satisfiability of a Boolean function can be determined simply by checking if 
its BDD representation is not BDD(O), because every BDD distinct from BDD(O) 
contains a path from its root to the terminal vertex with truth value 1. 

Encoding with explicit variables If the program TZ contains n different func- 
tion symbols, then the set X consists of Boolean variables. The transitivity 
condition T{X) consists of conjunctive clauses and tends to be the biggest 
part in H-r,{X). In most cases, however, the set X{R) of Boolean variables oc- 
curring in LPOn{X) tends to be only a small part of X. We call them explicit 
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Boolean variables. This motivates the definitions of the explicit transitivity con- 
dition T'{X{TZ)) and the explicit irreflexive condition I'{X{TZ)) defined in [8]. 
These conditions are based on graph-theoretic enumeration of some sort of min- 
imal paths and cycles of the graph G = {V, E) with V = ^{Tl) and E = X{'R.). 
We can use them in place of basic transitivity and irreflexivity for encoding the 
termination as follows: 

F^(X(7e)) = LPOitiXiTZ)) • T'iXiTZ)) ■ I\X{Tl)) (4) 

where we have written LPOti{X{TZ)) for LPO-ji{X) to emphasize that all the 
Boolean variables occurring in this function are members of X{TZ). It is proved 
that Hn{X) is satisfiable if and only if H!^{X{Tt)) is satisfiable [6]. 

Encoding without transitivity Since the computation of transitivity part 
often requires a fairly amount of time even for explicit conditions, it can cause 
an efficiency problem. This motivates the encoding without transitivity defined 
in [8]. Let P'{X{TZ)) be the explicit irreflexivity condition for X{TV) obtained by 
considering all minimal cycles plus all non-minimal simple cycles. Then we get 
the following encoding of termination: 

H!^{X{n)) = LPOn{X{Tl)) • I"{X{n)). (5) 

Note that this encoding has no transitivity part. Nevertheless, it is known 
that Hiz{X) is satisfiable if and only if HI^{X{Tt)) is satisfiable [6]. 

Example Figure 2 illustrates the BDDs for the two encodings H!/^{X{'R)) and 
H'uiXiJVj), respectively, where 

LPOn{X{K)) = {xhg + Xfg) ■ {xhf + Xgf) 

T (.X^(TZ)) = {Xhf "b ^fg "b ^hg') ' {^hg ”b ^gf “b ^hf') 

l'{X{n))=l"{X{Tl))=Xfg^Xgf 

and TZ consists of two rules f{h{x)) — » g{x) and g{h{x)) —> f{x). Note that the 
size of the BDD for H^{X{'R)) can be larger than that for H!^{X(Tt)) as is 
the case of this example. However, without the transitivity part, H^(X{TZ)) can 
often be computed much faster, as will be shown later. 

3 Heuristics 

3.1 Basic Procedure and Policy for Heuristics 

In this section we present the basic procedure for proving the satisfiability of 
H^{X{TZ)), using its BDD representation. Similar procedures for H-ji{X) and 
H'ti{X(TZ)) should be clear from this construction. We suppose that the proce- 
dure is given a program TZ, rule by rule, as an input and returns a truth value 
according to the satisfiability of H!^{X{JZ)). After initializing R = X{R) = 0 
and H{R) = BDD{0), the procedure executes the following steps repeatedly: 
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Fig. 2. Example BDDs for two encodings 



1. If all rules have been input, then return true. H'n{X{R)) is satisfiable. 

2. Input a new rule I —> r, and add it to R, resulting J?' = i? U {Z — > r}. 
Let AX be the set of Boolean variables occurring in LPOi^ri^) but not in 
X{R). Introduce a variable ordering to make X ( J?') = X (i?) U AX a linearly 
ordered set. 

3. Let ALPO = BDD{LPOi,r{X)). Let AT' and AT be the BDDs of the 
Boolean product of the minimal explicit transitivity and irreflexivity con- 
ditions, respectively, containing at least one Boolean variable from AX. 

4. Let H{R') = BDD{H{R) ■ ALPO • AT' ■ AT). If H{R') = BDD{0), then 
return false. H'j^,{X{R')) is unsatisfiable. 

5. Set R', X{R') and H{R') to R, X{R) and H{R), respectively. 

Note that the procedure is nondeterministic in the following three points. 

(1) The order in which a new rule is input in step 2 is arbitrary. 

(2) The linear ordering to be determined in step 2 is arbitrary. 

(3) The order in which the product of four BDDs is taken in step 4 is arbitrary. 

Implementation of this arbitrariness can affect the efficiency of the procedure. 
In general BDD terminology, this problem can be described as the problem of de- 
termining the variable ordering for (2) and the operation ordering for (1) and (3). 
We need good heuristic choices to implement the procedure effectively. However, 
we cannot devise and examine unlimited number of heuristics. In the following, 
we discuss the basic policy for the framework of the heuristics considered in this 
paper. The policy consists of the following three items. 

1. We fix the order of rule input to the order of the rules specified by the users. 
This is because we assume the system verifies the termination incrementally 
(each time a rule is input) and on interactive systems we cannot control the 
order of input by the users. 
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2. We only consider classes of variable orderings that can be extended incre- 
mentally. This means that when a new Boolean variable has been created in 
the procedure, we add it to the current set of Boolean variables and extend 
the associated ordering consistently (without violating the current ordering). 
More restriction will be discussed later. 

3. There are 18 possibilities for computing the product of four BDDs — 12 
cases for liniar product of the form ((a ■ b) ■ c) • d, and 6 cases for the form 
(a-b) ■ {c- d), taking into account that multiplication is commutative. In our 
experiments, we consider all of them. 

3.2 Variable Ordering 

As we have already discussed, we only consider classes of orderings that can 
be extended incrementally. We further restrict ourselves to orderings depending 
only on the orderings in which Boolean variables and function symbols have been 
created or encountered by the procedure. This is because such orderings strongly 
depend on the orders of the user’s input, which heuristically reflect the user’s 
mental model associating symbols and rules with their meanings and structure 
in the real world. In other words, the orders of the user’s input in interactive 
systems are not random but likely to be consistent with some implicit structure 
of meanings. In the following we introduce some classes of variable orderings 
satisfying these restrictions. 

- Rcindom ordering (RAND) : This ordering just orders the variables ran- 
domly, according to the uniform distribution. We do not think this ordering 
is effective but use it as the basis for evaluating other orderings. 

— Generation ordering (GO): This ordering orders variables as generated. 
More precisely, it is the ordering in which we encounter the Boolean variables 
while expanding the formulas LPOi^r{X) for each input rule i — > r in a 
depth-first, left-to-right manner according to the definition given in [8]. This 
scheme is quite natural in terms of implementation. 

- Left ordering (LO): We define two total orderings <l and <r on the set 
of function symbols; 

f <L 9 f occurs before g in the left-hand sides of rules, 

/ <R 9 ^ f occurs before g in the right-hand sides of rules. 

The left ordering is defined hy Xfg < Xij if and only if / <£, f or (/ = i and 
9 <R j)- 

— Right ordering (RO): This ordering is the same as LO except that the 
ordering <r dominates over <l. 

We also consider the orderings obtained by reversing each one of the above 
orderings except RAND and call them the reversed generation ordering (RGO), 
the reversed left ordering (RLO), and the reversed right ordering (RRO), respec- 
tively. 
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3.3 Operation Ordering 

The problem of operation orderings in our basic procedure includes the choice 
of order in which the Boolean multiplication of four BDDs are taken. From a 
more global viewpoint, however, it should be stressed that a kind of operation 
orderings is already built in the framework of the procedure itself, because it 
integrates the entire BDD as a product of interleaved increments of three kinds of 
constraints. More precisely, let (ALPO)j, (AT')j, and {AI')j be the incremental 
constraints obtained from the j-th rule input with 1 < j < m, and let (AH')j 
be their product. Clearly, the basic procedure computes the product 

= {AH')^ ■ (AH'h ■ • • {AH'U 

from left to right. This order of integration of constraints is definitely different 
from the one that computes the product simply according to the original defini- 
tion (4). We refer to the former and the latter as the incremental and the batch 
mode operation orderings, respectively, and will present a brief report on the 
experimental results in the next section. 



4 Experiments and Comparison 

We have performed comprehensive experiments on typical sample programs 
taken from several application domains, including hardware diagnosis, software 
specification, and mathematics. In this section, we present some results to show 
how the efficiency of our basic procedure is affected by the choice of variable 
orderings and operation orderings. In particular, we will see that in most cases 
the use of RGO for the variable ordering leads to good results. Then we will 
briefly report other experimental results which show that the proposed proce- 
dures implementing and H'^{X{Tt)) are significantly more efficient 

than the implementation of the naive encoding Hti{X). 

4.1 Effects by Vciriable Orderings 

We have selected two particular problems for presenting the results in detail. 
One is taken from the field of the model-based hardware diagnosis. We refer 
to this problem as CIRCUIT [7]. The rules have been introduced by directing, 
from left to right, the equations that specify the behavior of the full adder. The 
number of explicit Boolean variables for this problem is 36. 

The other problem referred to as solitaire consists of 28 rules taken from 
the field of the algebraic specification of software systems [13]. The number of 
explicit Boolean variables for this problem is 46. 

The two problems are input to our basic procedure written in Lisp and run 
on a PC. The results are summarized in Tables 1 and 2. Each entry should 
be interpreted as follows. The var order shows the names of heuristic variable 
orderings. The BDD size is the number of nonterminal vertices of the resultant 
BDD, uniquely determined by the variable ordering employed. The tree size 
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Table 1. Experimental results for CIRCUIT 



var order 


RAND 


GO 


LO 


RO RGO RLO RRO 


BDD size 


39 


33 


32 


33 


33 


32 


33 


tree size 


41 


41 


39 


40 


54 


55 


55 


max size 


5804 


751 


1232 


2625 


594 


965 


2527 


generation 


19725 


8776 13175 


9965 


2059 3307 


4253 


comparison 


30713 11871 17387 12801 


3668 6265 


7679 


CPU time 


9533 


797 


1921 


1067 


70 


225 


357 



Table 2. Experimental results for solitaire 



var order 


RAND 


GO 


LO 


RO RGO 


RLO 


RRO 


BDD size 


1459 


398 


451 


494 


400 


515 


482 


tree size 


8231 17574 18495 10951 12894 15365 17527 


max size 


2062 


1601 


1407 


604 


675 


898 


577 


generation 


11551 


6192 


6764 


3994 


1627 


2243 


3659 


comparison 


17787 


8546 


9828 


6140 


3945 


5424 


6240 


CPU time 


2479 


452 


666 


220 


74 


169 


242 



is the number of vertices of the tree obtained by transforming the BDD into 
the equivalent tree form. Comparison of the BDD size with this tree size can 
show us how many duplicate vertices are shared in the BDD. The max size 
indicates the size of the biggest BDD temporarily generated in the course of the 
procedure. Even if the variable orders are the same, this value depends on the 
operation ordering employed. The values shown here are the results of employing 
the ordering that will be recommended in the next section. The generation shows 
the total number of vertices created in the procedure. The comparison counts 
the number of comparison (with respect to the variable ordering) of two Boolean 
variables in the generation of BDDs. The CPU time is the time (in seconds) 
required for getting a solution. The entries for the random ordering (RAND) are 
the average of the results of ten trials. 

The results show that all the six heuristic variable orderings are more efficient 
than the random ordering (RAND) in both time and space. We notice that the 
difference of the BDD size and tree size for RAND is only two, meaning that 
only two vertices are shared in the BDD. In contrast, for the other orderings, at 
least seven vertices (17.5 %) are saved in adopting the BDD representation. In 
particular, the cases of RGO, RLO, and RRO are remarkable, because the tree 
sizes are greater but the BDD sizes are smaller than the case of RAND, thanks 
to the saving of more than twenty vertices (38.8 %). 

The most efficient variable ordering depends on the problems and the defi- 
nition of the ’’efficiency.” However, we recommend the use of RGO, because in 
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most cases it shows the best (or at least relatively good) perfomance in time and 
space. This is also true for other problems mentioned in the following subsections. 

It has been found that the BDD structure provides compact representations 
for many practically important Boolean functions in VLSI design, etc. In gen- 
eral, however, not all Boolean functions are suitable for the BDD representation. 
Now let us see that our Boolean function H^{X{TZ)) is suitable for BDD repre- 
sentation. Surprisingly, for most cases (including our two sample problems) this 
is true even if we use RAND. Let us check it out for our sample problems. In 
most implementations, a BDD vertex is stored in three words of memory, with 
each word used for each pointer for the two outgoing edges and the identifica- 
tion of the Boolean variable. Supposing each word is 32 bits, we can store a 
single vertex in 96 bits. Then we can see that the 39 and 1459 vertices of the 
BDDs created in the sample problems with RAND can be stored in about 2^^ 
and 2^® bits, respectively. On the other hand, the truth tables for representing 
the corresponding two Boolean functions with 36 and 46 Boolean variables would 
require 2®® and 2“*® bits, respectively. 



4.2 Effects by Operation Orderings 

We have seen that our basic procedure implicitly incorporates a kind of operation 
ordering called the incremental mode, which is different from the mathematically 
natural way of the batch mode. By several experiments, we have clearly found 
the incremental mode superior to the batch mode. Actually, the BDD size and 
the CPU time for the batch mode were more than 10 and 18 times greater, 
respectively, than the incremental mode in all the experiments. 

The order of taking the multiplication of the four BDDs did not affect the 
performance so dramatically, but the types of {{H{R) ■ a) ■ b) ■ c, imposing the 
constraints sequentially on top of the current BDD, performed better than the 
others. In particular, we can recommend the use of {{H {R) • ALPO) ■ AI') • AT', 
because in most cases it behaved the best. 

Table 3 shows the results for five other problems taken from the domain of 
mathematics. Those are all the problems given in [14] that contain at least five 
function symbols and are terminating. The variable ordering is fixed to RGO, 
and the operation ordering is fixed to our recommendation above. The table 
includes two new entries for the number of rules and the number of explicit 
Boolean variables. We can say that all the problems have been solved efficiently. 



4.3 Effects by Explicit Conditions 

We briefly describe how the explicit transitivity and irreflexivity conditions on 
X{TZ) result in more efficiency than the ordinary transitivity and irreflexivity 
conditions on X. Indeed, the use of if^(X(7^)) is far better than Hti{X). This is 
justified by our experiments in which the use of Hn{X) did not yield a solution of 
the CIRCUIT problem within 48 hours. Actually, we found that this run required a 
temporary BDD whose maximum size is 35 times greater than the BDD required 
in the run for H^{X{'R)). 
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Table 3. Experimental results for mathematical systems 



No. 


12 


26 


27 


29 


31 


rules 


9 


10 


11 


7 


7 


explicit vars 


14 


18 


17 


18 


11 


BDD size 


16 


15 


37 


34 


15 


tree size 


20 


15 


56 


102 


31 


max size 


85 


81 


116 


43 


16 


generation 


311 


291 


356 


175 


60 


comparison 


343 


327 


391 


174 


53 


CPU time 


1.16 0.94 1.33 0.35 0.10 



Table 4. Experimental results for transitivity removal 



No. 


CIRCUIT 


SOLITAIRE 


var order 


GO RGO 


GO RGO 


BDD size 


33 


35 


93 


93 


tree size 


48 


100 4500 


4043 


max BDD size 


126 


157 


107 


99 


generation 


1063 


358 


918 


286 


comparison 


1199 


846 1002 


390 


CPU time 


8.95 


6.95 


7.56 


2.62 



4.4 Effects by Transitivity Removal 

Table 4 shows the results of the experiments in which the previous two sample 
problems were solved by using H^{X{TZ)) in place of H^{X{TZ)). We only show 
the entries for the GO and RGO orders. Comparing this table with Tables 1 
and 2, we can see that the use of H^{X{TZ)) is more efficient than H^{X{TZ)). 
Theoretically speaking, we could try to construct problems in which H!^{X{'R,)) 
is more efficient, but we have not encountered such problems in practice, yet. 

5 Conclusion 

In conclusion, let us compare our method with two other works. One is a most 
simple way based on backtracking [5]. It is well-known that the simple backtrack- 
ing suffers from inefficiency caused by futile backtracking, rediscovering contra- 
dictions, and rediscovering inferences. Actually, the CPU time for computing all 
solutions for the CIRCUIT and SOLITAIRE problems by the backtracking method 
was 108 and 2700 seconds, respectively, compared with 70 and 74 seconds by 
our method. The other work [7] uses a reason maintenance system to avoid the 
drawbacks caused by the simple backtracking. It is reported that this method 
was successful in getting a single solution efficiently. In contrast, our method is 
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effective only if all solutions are sought (or if you want to check that there are 

no solutions). 
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Abstract. In most accounts of common-sense reasoning, only the most 
preferred among models supplied by the evidence sire retained (and the 
rest eliminated) in order to enhance the inferential prowess. One prob- 
lem with this strategy is that the agent’s working set of models shrinks 
quickly in the process. We su-gue that instead of rejecting all the non- 
best models, the reasoner should reject only the worst models and then 
examine the consequences of adopting this principle in the context of 
abductive reasoning. Apart from providing the releveint representation 
results, we indicate why an iterated eiccoimt of abduction is feasible in 
this framework. 

Keywords: belief revision, common-sense reasoning, philosophical foun- 
dations 



1 Introduction 

In many approaches to common-sense reasoning [6], belief change [3] and ab- 
ductive reasoning [10] appeal is made to the principle of minimal change. This 
principle can be viewed as the commonsensical principle of selecting the best 
from the available set of alternatives [12]. In a recent work [9], Nayak et al. have 
advocated the adoption of the principle of rejecting the worst in lieu of the prin- 
ciple of selecting the best in the context of AGM style belief revision [1,3]. The 
aim of this work is to extend this idea to abductive reasoning - in particular, to 
explore the consequences of discarding the choose the best principle in favour of 
reject the worst principle in the context of abductive belief change [10]. 

This paper is organised as follows. In the next section we argue that the 
principle of selecting the best is inappropriate in contexts of a certain character, 
and should be discarded in favour of the principle of rejecting the worst. In 
section §3 we quickly present the account of abductive belief change due to 
Pagnucco [10] and argue that it is one of those contexts where adopting the 
principle of selecting the best has perilous consequences. Section §4 explores the 
consequences of adopting the principle of rejecting the worst in the context of 
abductive belief change. Section §5 is devoted to soundness and completeness 
results for this approach. We end with a brief discussion regarding the feasibility 
of an iterative account of abductive belief change in the proposed framework. 

N. Foo (Ed.): AI’99, LNAI 1747, pp. 365-377, 1999. 

(£) Springer- Verlag Berlin Heidelberg 1999 
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2 The Perils of Choosing the Best 

The principle of choosing the best essentially says that if there are multiple 
available ways of attaining a certain (desired) goal, one should choose those that 
one considers best (according to some contextually defined preference criteria). 
This principle is very appealing indeed. A moment’s reflection shows that the 
appeal of this principle lies in a simple linguistic fact, namely, that the expression 
“best item” more or less means an item that should be chosen if offered as an 
alternative. The principle “select the best” is hence a glaringly obvious but 
entirely content-less principle. This principle means that one should select what 
should be selected, and hence is only as good as the underlying implementation 
of the concepts “best”. More to the point, there is the underlying assumption 
that one already knows what is best in the given context. In particular, if one 
does not exactly know what the best item in the choice set is, and one considers x 
to be only a first approximation to what might be the best, the principle “select 
the best” has no prescriptive force as to whether or not one should select x. 

Let us now consider a concrete situation. Suppose you are planning to fly 
from Australia to Europe and you are considering which airline to choose. Your 
choice set, of course, is the set of airlines that provide service from Australia to 
Europe. The simple suggestion, “Choose the best airline” , is not of much help 
since it does not tell you which airline to choose. A bit of soul searching might 
explicate your criteria of choice - (low) price, (good) service, (less) number of 
stopovers, (less) hours of waiting at airports, (convenient) departure and arrival 
times, (good) safety record. ^ 

Now, if you knew how to quantify these properties of an airline, and the 
relative importance of these properties (eis weights) so far as your choice is con- 
cerned, then you could presumably take the weighted sum of the first figures 
as the desirability of an airline and easily determine what the best eiirline is. In 
other words, given that we have an exhaustive list of the preference criteria, their 
relative importance as weights and the relevant properties of individual airlines 
as quantities, the criteria in question can be combined in order to provide a sin- 
gle (read ultimate) preference criterion. Now, applying the principle of selecting 
the best we can find the desired airline to be contacted. (If there are more than 
one of them, we can devise some tie-breaking mechanism.) 

In practice, however, the required quantifications may not be available. If so, 
we are not dealing with a single (read ultimate) preference criteria, but a bunch 
of them and we have to sequentially use these criteria to determine the item to 
be selected. 

The question then is whether the principle of selecting the best can be applied 
in this situation. In our favourite example, suppose that the criteria in question 

^ This is not necessarily the only criteria you are going to consider - you might come 
up with more criteria, e.g. the type of frequent flyer program offered by the airline 
and want to add them to the hst later without recomputing the best adrhne from 
scratch again. But without loss of genereJity, let us pretend that the list in question 
is complete. 




Abduction without Minimality 367 



were arrived at (and applied) in the given sequence. Table 2 encodes informa- 
tion as to the preference among airlines with respect to different criteria.^ For 



Price 


Service 


Stopovers 


Waiting 


Timing 


Safety 


AF, QA 


JAL 


BA 


SW, KLM LU, SA 


LU 


BA 


SW, SA 


SW, LU 


LU 


SW 


SW, KLM 


LU 


QA, KLM KLM, JAL SA, JAL 


JAL 


JAL, BA 


KLM, SA BA, LU 


AF, QA 


QA, AF 


AF, KLM SA 


JAL, SW AF 


SA 


BA 


QA, BA 


QA, AF 



Table 1. Preference over airlines based on different criteria. 



instance, according to this table, Air France and Quantas offer the best price 
followed by British Airways which is in turn followed by Lufthansa. On the other 
hand, JAL and Swiss Air offer the worst price, whereas KLM and Singapore Air- 
line offer next to worst. Suppose you consider low price as the primary factor. 
By applying the principle of selecting the best, the choice set is shrunk to just 
Air France and Quantas. Next you come up with the criterion “good service” . 
Since your choice set is now {Air France, Quantas} and Quantas fares better 
than Air France on service count, your choice set is now reduced to the singleton 
set {Quantas}. After that you are stuck with Quantas, no matter how terrible 
its safety record is, no matter how inconvenient its timing is, etc. - unless you 
are prepared to go back to the original choice set and apply the criteria in a 
different sequence.^ In fact, according to our table, Quantas has the worst safety 
record, the worst timing and only next to worst in both waiting and number of 
stopovers. You are still committed to choose this airline due to the principle of 
choosing the best! 

If instead of price, you had started with the quality of service, you would have 
been stuck with JAL right at the outset although it is most expensive among 
the air lines and is only mediocre as far as stopovers, waiting time and timing 
are concerned and next to worst in safety record. 

The perils of the “select the best” approach is obvious. By selecting the 
best, we are severely restricting the available choices for future selection, and 
end up choosing an option which is possibly not at all preferable on some other 
count. The way out of this peril is also equally obvious - we should follow some 
approach which is less restrictive. One way to achieve this goal is, instead of 
rejecting every option which is non-best, we should rather reject every option 
that is worst. Applying this alternative principle to our pet example, when we 
consider the criterion of price, we reject JAL and Swiss Air, and are still left with 
six other airlines. Next, on service count we eliminate Air France, on Stopover 

^ This table is completely fictitious, and hEis nothing to do with what really is or is 
not the case. 

^ But that does not solve the problem, only postpones it! 
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count we reject Singapore Airlines, on waiting count British Airways, on count 
of timing Quantas (BA is already eliminated at this point), and on safety count 
KLM (at this point SW, JAL, BA, SA, QA and AF are no longer available for 
elimination). Thus we end up selecting Lufthansa which is ranked mediocre on 
count of price, second best on count of stopovers and waiting, best on counts of 
timing and safety, and next to worst only on count of service. Many would agree 
that this is a lot more sensible choice than Quantas, given Table 2. 

We have thus noticed that there are contexts when the “reject the worst” 
principle seems to be more sensible than the “select the best” principle. We will 
conclude this section with a sketchy outline of the features which, when present, 
make a context more appropriate for the “reject the worst” principle as opposed 
to the “select the best” principle. 

- First of all, these principles apply to a choice context. If no choice is at issue, 
then these principles are irrelevant in that context. 

- Given that a choice is to be made, the set of alternatives (or the choice set) 
is clearly specified - and that the choice must be made from members of that 
set. For instance, in our example, since Cathay Pacific is not an available 
option, the agent is not allowed to choose Cathay Pacific. 

- It is understood that the choice being made is not necessarily the final choice. 
It is possible that the agent might be required to narrow down the choices 
further in light of hitherto unavailable criteria. 

We will maintain that the above three are the salient features of a choice context 
in which the “choose the best” principle should be given up in favour of the 
“reject the worst” principle. In the next section we will show how abductive 
reasoning is a context with these features, and hence is appropriate for the 
“reject the worst” principle. 

3 Abductive by Choosing the Best 

Recently a very interesting account of abduction has been offered by 
Pagnucco [10] as an extension of the classic AGM system of belief change [Ij. 
We will briefly recount the AGM system of belief change followed by Pagnucco’s 
account of abductive belief change. 



3.1 Belief Chsinge 

In the AGM system, a belief state is represented as a theory (i.e., a set of 
sentences closed under your favourite consequence operation), new information 
(epistemic input) is represented as a single sentence, and a state transition func- 
tion, called revision, returns a new belief state given an old belief state and an 
epistemic input. If the input in question is not belief contravening, i.e., does not 
conflict with the given belief state (theory), then the new belief state is simply 
the consequence closure of the old state together with the epistemic input. In 
the other case, i.e., when the input is belief contravening, the model utilises 
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a selection mechanism (e.g. an epistemic entrenchment relation over beliefs, a 
nearness relation over worlds or a preference relation over theories) in order to 
determine what portion of the old belief state has to be discarded before the 
input is incorporated into it. 

From here onwards we will assume a finitary propositional object language 
Let its logic be represented by a classical logical consequence operation Cn. The 
yielding relation h is defined via Cn as: Fh a iff a G Cn(r). 

The AGM revision operation is required to satisfy the following rationality 
postulates: Let if be a belief set (a set of sentences closed under Cn), the sentence 
X G £ be the evidence, * the revision operator, and K* the result of revising K 
by X. 

(1*) K* is a theory 
(2*) xeK* 

(3*) K* C Cn(K U {x}) 

(4*) If if 1/ -.X then Cn{K U {x}) C if* 

(5*) K* = ifx iff h -.X 
(6*) If F X ^ 2 /, then if *= if; 

(7*) if(;^^)CCn(if;u{2/» 

(8*) If-.j/^if;thenCn(if;uM)Cif(*^^^) 

Motivation for these postulates can be found in [3]. We call any revision opera- 
tion that satisfies the above eight constraints “AGM rational” . These postulates 
can actually be translated into constraints on a non-monotonic inference rela- 
tion |~ [5]. 



3.2 Semantics of Belief Change 

There are various constructions of an AGM rational revision operation. The one 
we will present is equivalent to the construction via “Systems of Spheres” (SOS) 
propounded by Adam Grove [7]. Let Ai be the class of maximally consistent 
sets w of sentences in the language in question. The reader is encouraged to think 
of these maximal sets as worlds, models or scenarios. We will use the following 
expressions interchangeably: “w |= a”, “a allows w" and w G [a]”, where w is 
an element in M and a is either a sentence or a set of sentences.) Given the belief 
set if, denote by [if] the worlds allowed by it, i.e., [if] = {w € Ai j K C w}. 
(Similarly, for any sentence x, let [x] be the set of “worlds” in which x holds.) 

A system of spheres is simply represented by a connected, transitive and 
reflexive relation (total preorder) C over the set Ai such that [if] is exactly 
the set of C-minimal worlds of A4. Intuitively, w O w' may be read as: w is 
at least as good/preferable as w' (or, w' is not strictly preferred to w). We 
define the Grove- revision function G* as: [if^*] = {to G [x]| for all w' G [x], 

A finitary language is a leinguage generated from a finite number of atomic sentences. 

So the number of sentences in this language is not finite. 




370 Abhaya C. Nayak and Norman Y. Foo 



w E w'}, whereby K^* = f][K^*]- It turns out that the AGM revision postulates 
characterise the Grove revision operation G*.^ 

A visual representation of the crucial case in the Grove Construction is given 
in Figure 1. 




Fig. 1. Minimality Based revision - the principal case 



In this, the area marked [x] represents the models allowed by the evidence x. 
The area \K] represents the model currently entertained by the agent, and the 
broken circles demarcate models according to their perceived plausibility. The 
farther a model is from the centre, the more implausible it is. The shaded part 
of [x] represents the least implausible of the models allowed by the evidence x - 
hence identified with [K^]. 

Viewed from this semantic angle, belief change is about preferential choice: 
\K^*] essentially identifies the subset to be chosen from [x] as the set of worlds 
that are E-best in [x]. 

We introduce the following notation for later use. 

Definition 1 A subset T of M is said to be C-flat just in case w Q w' for 
all members w,w' ofT. In this case, the members ofT are called Q-equivalent. 
w C w', on the other hand, is used as an abbreviation for {w E t^') A {w' ^ w) 



^ Readers acqumnted with Grove’s work will easily notice that given a system of 
spheres S, the relation Ex can be generated eis: w Ex uj' iff for every sphere S' that 
has w' as a member, there exists sphere S C S' with w as a member. On the other 
hand, given a total preorder E on At, a system of spheres Eq can be generated as 
follows: A set 5 C At is a sphere in 27c iff given any member w of <S, if w' E 
then w' is also a member of <S. It is easily noticed that the E-minimeil worlds of At 
constitute the central sphere, eind for any sentence x, the E-minimal members of [x] 
constitute [K^*] in the corresponding SOS. 
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3.3 Minimality Based Abduction 

In section §3.2 we offered a constructive approach to belief change via a similarity 
relation E among worlds. There is a well known alternative to this construction 
based on a binary relation over sentential beliefs, known as “epistemic entrench- 
ment” [4]. This relation may be viewed as ranking the beliefs based on their 
comparative strengths of acceptance. Alternatively, a constructive approach to 
belief change can also be based on the pair-wise comparison of the disbeliefs 
with respect to their strength of denouncement [2]. None of these approaches 
allow any nontrivial comparison among plausible hypotheses that have neither 
been accepted nor rejected by the agent (henceforth plausibilities). A case can be 
made, however, that these plausibilities, namely the hypotheses that the agent 
has suspended judgement about, can be meaningfully compared with respect to 
their plausibility. After all, the whole Bayesian tradition is based the probabilistic 
comparison of such plausibilities! 

If we grant that plausibilities can be meaningfully compared with each other, 
it has an interesting spin-off with respect to the Grovian Systems of Spheres. 
Let us say, for a start, that of two plausibilities x and y, the former is more 
plausible iff some x-validating scenario is preferable (closer to the reality) to 
every y-validating scenario. However, since x and y are plausibilities, the most 
preferred x-validating and y-validating worlds are members of \K\ and [K\ is 
C-flat! So in order to allow meaningfiil comparison of plausibilities, we have 
to supplement the Grovian measure (primarily over M \ [K]) with a measure 
over [K\. That is precisely what Pagnucco does in [10] in order to offer us an 
account of abduction. 

Pagnucco effectively ignores (with good reason) the extra- [A”] systems of 
sphere and introduces an intra-[A] systems of spheres and examines the con- 
sequences of adopting a minimality-based belief change operation with respect 
to the later. The result is not belief revision proper since the pieces of evidence 
that are of interest here are not disbeliefs but plausibilities, and hence are con- 
sistent with the current knowledge. Since the result is in general stronger than 
classical expansion, it is closest to what has been called abduction or inference to 
the best explanation in the literature [11]. The following figure provides a visual 
representation of the abductive process suggested in [10]. 

Pagnucco has examined the properties of this abduction operation. Let K be 
the current belief set, x the evidence and -b the abductive expansion operation. 
The following list fully characterises this operation. 

(1+) A+ is a theory 
(2+) If -.X ^ A then X e A+ 

(3+) ACA+ 

(4+) If A 1/ -.X then A+ = A 
(5+) If A 1/ --X then -X ^ A+ 

(6+) If A h X y, then A+ = A+ 

(7+) A+CCn(A+^^)U{x» 

(8+) If -X ^ A+v, then A+^^j C A+ 

The motivation behind these properties can be found in [10]. 
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[K| 

|x| 

|K+xJ 



Fig, 2. Minimality based Abduction 



3.4 Failure of Minimality Based Abduction 

Despite its innovative approach, Pagnucco’s suggestion soccumbs to a serious 
problem. It has been early recognised that any belief change operation should 
satisfy the properties of category matching: the object that undergoes change 
must result in an object of the same category. Without this property, there is 
no guarantee that the resultant object can face up to another change. This was 
a major problem with the classical AGM approach to belief change; extensions 
of this approach avoid this myopic problem [8]. However, Pagnucco’s approach 
has not addressed this issue. In this, a structured object ([A]) undergoes an 
epistemic change and results in an unstructured object ([A+]) which, in turn, 
cannot handle further abductive change. 

In section §2 we outlined some features in presence of which the “reject the 
worst” principle is more appropriate than the “select the best” principle. It is 
easily verified that the context of abductive belief change has all these features. 
Hence we suggest that we give up “choose the best” principle in the context of 
abduction and adopt the reject worst principle instead. 



4 Abduction by Rejecting the Worst 

In the case of abduction, the crucial test is what happens when the evidence x 
is consistent with the current knowledge K. Accordingly, we will pretend that 
A4\[i<'] is C-flat, although [K] itself is, in general, not C-flat. This assumption is 
granted in Pagnucco’s account as well, and is similar in spirit to the assumption 
in [9] that K is C-flat. For convenience, we will assume that K is consistent. 

When the evidence is inconsistent with K, it is a boundary case, and it does 
not really matter how we deal with the boundery case. In this case, Pagnucco 
disallows any change in the current knowledge. On the other hand, we will stick 
to the classical AGM approach and assume that in this case the resultant state 
is inconsistent. (This accords well with the Reject worst principle - assuming 
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that the worlds outside [K\ are all equally preferred, they are all rejected; so the 
resultant set is empty.) 

Suppose now that the evidence x is consistent with the current knowledge K, 
namely, [K\ fl [i] 0. Then the choice set in question is the set of worlds 

\K\ n [x]. If not all members of this set are equally preferred (or dispreferred), 
then according to the Reject Worst Principle, at least one member of this set 
will be rejected, and the rest will be returned as [K '^ ] 

Accordingly, given an appropriate total preorder E on Ad for a belief set K 
we define the non-minimal abduction operation Q g (the subscript is henceforth 
dropped for readability except when the context is confusing) as follows: 

Definition 2 (from C to Q- ) Where C is a total preorder on Ad and K a belief set 
such that [K\ = {w | w C lu' for some w' € Ad}, 



[Kf] = 



' 0 if [K] n [x] = 0 

^ [K] n [x] else if [K] fl [x] is Q-flat 

(u) € [K] n [x] \ w [Z w' 

, for some w' € [K] fl [x]} other wise. 



This definition separates three distinct ways of processing the evidence, as pic- 
tured in Figure 3. The first case is represented by {z\. In this case all the models 
in [z] are eliminated. The area [y] represents the second case. Here, since the 
models in [K] n [y] cannot be discriminated on the basis of C alone, none of 
them is eliminated. The principal case, namely the third case, is represented by 
[x]. Here, among the models provided by [K] fl [x], the most implausible ones 
are eliminated and the rest are retained, perhaps, for future scrutiny. 




Id 

[K| 

- Id 

- 1K«1 



|K+y| 

ly| 



Fig. 3. Abduction Without Minimality 



® We should not use the symbol + here, since we have alre 2 idy used it to denote 
Pagnucco’s abductive abduction. Later on we use a more appropriate symbol Q- . 
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How good is Q , as defined above, as an abduction operation? We suggest that 
any abduction operation must have the following basic properties: 

(10 ) is a theory 
(2Q ) a; e 
{Sa)KC 

(40 )lf K \/ -ix then Kf \f ->x 

(50 ) If in/ ± then = K x K \- ->x 

(60 )lf K h X ^ y, then 



The first three of these properties are obvious requirements for any expansion 
operation, abductive or otherwise. Given (20 ), the fourth condition says that 
evidence consistent with the current knowledge cannot introduce inconsistency 
into one’s body of knowledge. The fifth property says that an abductive pro- 
cess results in an inconsistent body of knowledge exactly when the evidence in 
question conflicts with the current knowledge. The sixth property says that the 
syntactic representation of evidence is irrelevant to the result of an abductive 
expansion, modulo the consequences of the current knowledge. 

It is interesting to compare these basic properties with the first six postu- 
lates proposed by Pagnucco [10]. The basic difference is based on the difference 
between the corresponding second properties. Unlike (2Q ), Pagnucco ’s Success 
postulate is conditional upon the evidence being consistent with the current 
knowledge. When the evidence is inconsistent with the current knowledge, ex- 
pansion is a boundary case, and how it is handled should not be given much 
importance. Accordingly, while we retain the AGM property of Success at the 
cost of allowing possible expansion into inconsistency, Pagnucco avoids such silly 
expansion at the cost of losing Success. This explains the difference between the 
fourth properties of abduction in the two systems. 

As will be reported in section §5, our abduction operation Q , apart from 
satisfying these basic postulates, also satisfies the following five supplementary 
postulates for the abductive expansion. 

(7.1Q ) g Cn{K U {x,y}) then C Cn{K^ U { 3 /}) 

(7.2S ) If = Cn{K U {y}) then Kf^y C Cn{K^ U {y}) 

(7.3Q- ) If n Cn(K U {y}) C Cn(K U {x}) 
then K^^y C Cn{K^ U {y}) 

( 8 Q ) If ATf / -y then Cn{K^ U {y}) C K^^y 
(9Q- ) If K\/ -.X, H -.y but KU{x}\/ -.y 
then Kf^y = Cn{K U {x,y}). 

Postulates (7. IQ )-(7.30 ) tell us under what condition a piece of evidence y 
loses its inferential power in presence of another piece of evidence x. For instance, 
(7.10 ) may be paraphrased as follows: Given some background knowledge K, 
if X is able to explain something that cannot be classically inferred from x and y 
together, then there is nothing that does not classically follow from y in pres- 
ence of what X alone explains, and yet is jointly explainable by x and y together. 
Postulate (80 ) on the other hand says that x and y jointly fail to explain some- 
thing that follows from y in presence of what is explainable by x only ify conflicts 
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with something that is explained by x. Finally, postulate (90 ) essentially says 
that even though evidence y does not conflict with x (and the background knowl- 
edge K), if y conflicts with something explainable in terms of x, then x and y 
jointly have no abductive force. 

One striking difference between our supplementary postulates and those 
in [10] is that while in the latter constraints are sought for handling disjunc- 
tive evidence (i.e., on K^wy), in the current approach constraints are sought on 
the result of processing conjunctive evidence (i.e., on K^^y). The primary rea- 
son for this is that we wanted the connection between the properties of ® in [9] 
and O to be made obvious. We however believe that the same effect can be 
achieved by putting constraints on as are achieved by constraining K^vy 
in the postulates (7.10 )-(9D ). 

5 Technical Results 

In section §4 we proposed and discussed a list of abduction postulates. It remains 
to be seen whether the proposed postulates in fact capture the semantic intuition 
behind Definition 2. The following results show that the postulates (IQ )-(9Q ) 
indeed characterise the operation in question. The proofs have been omitted due 
to space constraints. Our first result, the soundness result, shows that the expan- 
sion operation O defined from the total preorder Q operation via Definition 2 
in fact satisfies properties (ID )-(9Q ). 

Theorem 1 Let Q be a total preorder. Let the operation D = □ c defined 
from C in accordance with Definition 2. The operation □ satisfies the properties 
(la - 90 ). 

Next we show the completeness result to the effect that given an abductive 
expansion operation O that satisfies (lO — 90 ) and a fixed belief set K we can 
construct a binary relation Og-.K with the desired properties. (We will trade off 
rigour against readability, and normally drop the subscripts.) In particular, we 
will show that, where C is the relation so constructed: (1) C is a total preorder 
over M, (2) the SOS (System of Spheres) corresponding to C is a SOS whose 
maximal elements are exactly the members of A4 \ \K] and (3) = K^~ for 

any sentence x. 

Definition 3 (from O to C) Given a revision operation O and a belief set K, 
w Ed, if w' iff either w' ^ [K] or both {w,w'} C [A'] and w € [K^] whenever 
w' e [K^], for every sentence x such that {w,w'} C [K] fl [xj. 



Theorem 2 Let D 6e a revision operation satisfying (ID ) — (90 ) and K a 
belief set. Let Q he generated from O and K as prescribed by Definition 3. Then 
^ is a total preorder on M such that \K] is the set ofQ-non maximal elements 
of M. 
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Theorems 1 and 2 jointly show that the postulates (IQ )-(90 ) exactly char- 
acterise the abductive expansion operation constructed from the C relation. 

Furthermore, the total preorder constructed from a given non-minimal 
revision operation O and belief set K is the desired C in the sense that the 
non-minimal revision operation constructed from it, in turn, behaves like the 
original operation O with respect to the belief set K. 

Theorem 3 Let 12 be a non-minimal belief revision operator satisfying postu- 
lates ("IQ — 9Q j and K be an arbitrary belief set. Let Q be defined from Q 
and K in accordance with Definition 3. Let Q ' = O c defined from C, in 
turn, via Definition 2. Then for any sentences x (and the originally fixed belief 
set K) it holds that = K^' . 

Conversely, one can start with a total preorder C, construct a revision oper- 
ation Q from it via Definition 2 and then construct a a total preorder C from 
that Q in turn via Definition 3, then one gets back the original relation C. 

Theorem 4 Let Q be a total preorder on M and [K] the set of C-nora maximal 
members of M.. Let Q be defined (for K) from C via Definition 2. Let C'=Ccif 
be defined from Q , in turn, via Definition 3. Then w Qw' iff w Q' w' for any 
two worlds w,w' € M 

6 Discussion: Iterated Abduction 

In this paper, we examined the consequences of adopting the reject worst prin- 
ciple in the context of abductive belief change. This was done with the intent 
of extending a recently proposed account of abduction in [10] so that iterative 
abduction can be accommodated in the resulting framework. In the account 
of abduction provided in [10], there is no room for abduction. This is because 
Pagnucco’s minimality based abduction relies on the degree of plausibility (of 
sentences that are at the time neither believed nor disbelieved) but provides no 
such measure in the resultant belief state. Graphically speaking, (see Figure 2) 
[K'^], the candidate for [K] in the next generation, is devoid of any structure, 
making it impossible to generate a measure of plausibility.^ So, after one round 
of abduction, this account will reduce to classical AGM expansion. 

The account of abduction provided in this paper addresses this shortcom- 
ing. In general, there is enough structure left in \K^\ making further rounds of 
abduction possible. Of course, after eeich round of abduction, less structure is 
left in the new [K\, and eventually it will flatten out. Thus, it would appear 
as if our account merely postpones the problem of iterated abduction. Such a 
conclusion, however, is rather premature. Often, the evidence we handle is not 
consistent with our current knowledge. In such a case, it is understood that the 
agent should use a revision operation (instead of an expansion operation). And if 
the agent uses IS, although currently the plausibility measure is flat, the revision 

Strictly speaking, it will generate a flat plausibility measure. 
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is likely to inject structure into one’s plausibility measure. A rigorous presenta- 
tion of this material is beyond the scope of this paper, and is the subject of a 
different work. 
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Abstract. A unifying semantic framework for different reasoning ap- 
proaches provides an ideal tool to compare these competing alterna^ 
tives. A historic example is Kripke’s possible world semantics that pro- 
vided a unifying framework for different systems of modeJ logic. More 
recently, Shoham’s work on preferential semantics similarly provided a 
much needed framework to uniformly represent and compare a variety of 
nonmonotonic logics (including some logics of action). The present work 
develops a novel type of semantics for a particular causal approach to 
reasoning about action. The basic idea is to abandon the standard state- 
space of possible worlds and consider instead a larger set of possibilities 
— a hyper-space — tracing the effects of actions (including indirect ef- 
fects) with the states in the hyper-space. Intuitively, the purpose of these 
hyper-states is to supply extra context to record the process of causeility. 

Keywords: common-sense reasoning, nonmonotonic reasoning, tempo- 
ral reeisoning. 



In recent artificial intelligence research into reasoning about action much at- 
tention has been focussed on the role of causality [8,13]. While there is significant 
consensus that a causal component to reasoning systems is not explicitly nec- 
essary to solve the frame and ramification problems, it is generally considered 
necessary for concise solutions to these problems. 

Causal theories of action have become prominent in a proliferation of rea- 
soning about action frameworks. Each of these frameworks is couched in its 
own syntax and calculus for providing solutions to the frame and ramifications 
problems. But a cursory glance at this situation is sufficient to clearly indicate 
that this is grossly inadequate. What is required is an independent semantic 
motivation for these various proposed frameworks. 

A unifying semantics would provide a basis upon which to compare the myr- 
iad approaches to reasoning about action. Moreover, it would give a clearer 
insight into the nature of causality underlying these various frameworks. While 
the prospect of a unifying semantics is a bit too ambitious for the present work 
we hope that some of the morals drawn may be able to serve as a first step in 
this direction. 

One landmark proposal in the early literature on reasoning about action 
was Shoham’s [12] preferential semantics. This semantics provided insight into 

N. Foo (Ed.): AF99, LNAI 1747, pp. 378-392, 1999. 

© Springer- Verlag Berlin Heidelberg 1999 




Causal Propagation Semantics 379 



several areas of reasoning in artificial intelligence including belief change [1] 
and nonmonotonic reasoning [5]. Recently Peppas et al. [9] have shown that it 
is not possible to furnish a traditional preferential style semantics for a recent 
causal approach to reasoning about action — McCain and Turner’s causal theory 
of action [8]. They provided an augmented preferential semantics capable of 
characterising this framework. Subsequent to McCain and Turner’s framework 
Thielscher [13] has proposed a causal approach to reasoning about action which, 
under certain specific conditions, subsumes McCain and Turner’s approach.^ 
However, this framework is devoid of a suitable semantics. As a result, it is 
dfficult to place this framework in perspective with competing proposals. 

Put briefly, the main aim of this paper is to furnish a semantics characterising 
Thieslscher’s casual theory of action . We do so by proferring a novel type of 
semantics. The basic idea is to abandon the standard state-space of possible 
worlds and consider instead a larger set of possibilities — a hyper-space — 
tracing the effects of actions (including indirect effects) with the states in the 
hyper-space. Intuitively, the purpose of these hyper-states is to supply extra 
context to record the process of causality. 

In the following section we outline the necessary technical preliminaries for 
an understanding of this paper. In section 2 we briefly sketch Thielscher’s causal 
theory of action. In Section 3 we introduce the hyper-space semantics that we 
shall use to characterise Thielscher’s [13] approach. Section 4 will establish the 
necessary representation theorems. Section 5 discusses the importance of these 
results. 

1 Techniced Preliminaries 

Let be a finite set of symbols from a fixed language B, called fluent names. 
A fluent literal is either a fluent name / € .F or its negation, denoted by ->/. 
Let be a set of all fluent literals defined over the set of fluent names F". 
We will adopt from Thielscher [13] the following notation. If e 6 then ]e] 
denotes its affirmative component, that is, ]/] = ]-i/[ = /, where f £ T. This 
notation can be extended to sets of fluent literals as follows: ]5] = {]/] : / € S'}. 
By state we intend a maximal consistent set of fluent literals. We will denote 
the set of all states as W, and call the number m of fluent names in F" the 
dimension of W. By [0] we denote all states consistent with the sentence (l> & B 
(i.e., [<f>] = {w ^ W ■. w \- <p}). 

2 Background 

The idea of minimising change in order to deduce the set of possible next states 
(successor states) is used quite broadly in action theories. Sometimes the notion 
of minimal change is defined by set inclusion (eg., PMA) [14,4,7,8], and often 

^ There is insufficient space for us to elaborate upon McCeiin and Turner’s approach 
here and to furnish a comparison with Thielscher’s alternate proposed. 
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incorporates the frame concept or the policy of categorisation [4,7], assigning 
different degrees of inertia to language elements (fluents, literals, formulas, etc.). 
Shortcomings of particular implementations of the principle of minimal change 
and the policy of categorisation are well-known: imprecise or capricious defini- 
tions of minimality metrics (eg., PWA [2] vs PMA [14]), difiBculties in properly 
categorising fluents as inertial and non-inertial, leading to increasingly complex 
selection mechanisms of action languages [7,10,13]), etc. These problems have 
generated attempts to use some notion of causality instead of or in addition to 
the principle of minimal change. For instance, some action theories try to em- 
body background information in the form of domain “causal laws” , pointing to 
the fact that, in general, propositions embracing causal dependencies are more 
expressive than traditional state constraints [6,8]. 

However, despite numerous attempts to combine a notion of causality with 
the principle of minimal change and/or policy of categorisation, multiple counter- 
examples keep reappearing, highlighting the intractability of the ramification 
problem. The framework suggested by Thielscher [13] criticised the categorisa- 
tion policy and the principle of minimal change, arguing for the necessity of an 
approach based on causality. Thielscher’s approach was intended to provide a 
method to avoid unintuitive indirect effects (ramifications), while accounting for 
causal relationships of a domain in hand. One of the perceived strengths of the 
Thielscher approach was an ability to capture not only all intuitively expected 
resulting states with minimal distance to the initial state, but also non-minimal 
solutions - “perfectly acceptable provided all changes are reasonable from the 
standpoint of causality” [13]. In other words, the non-minimal solutions are those 
states which are reachable via causal propagation from an intermediate state. 
This intermediate state is determined as the nearest state to the initial state, 
where the direct action effects hold, while some domain constraints may be vio- 
lated. 

Thielscher [13] criticised minimal change on the grounds that, in his view, it 
rejects a potential resultant state if it is obtained by changing the values of more 
fluents than strictly necessary. Arguably, this skewed view of the principle is too 
restrictive to warrant complete abandonment of the general notion of minimal 
change. In this paper we question Thielscher’s criticism of minimal change and 
contend that there is an element of minimal change at work in his framework. 
To demonstrate our claim we exhibit a semsintics for Thielscher’s causal theory 
of actions. This semantics can be clearly seen to employ a component of minimal 
change coupled with causality. 

Thielscher employs two crucial notions: action laws and causal relationships. 
Action laws essentially describe the immediate (or direct) effects of performing 
an action in a given state. Causal relationships are responsible for producing the 
indirect effects of actions. 

Thielscher employs the following notion of action specification. Each action 
law consists of: 

- a condition C, which is a set of fluent literals, all of which must be contained 

in an initial state where the action is intended to be applied; 
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— a (direct) effect E, which is also a set of fluent literals, all of which must 
hold in the resulting state after having applied the action. 

For simplicity, it is assumed that condition and effect are constructed from 
the very same set of fluent names. Therefore, the state resulting from a direct 
effect is obtained by simply removing set C from the initial state at hand and 
adding set E to it. However, execution of an action may cause further state 
transitions. 

Definition 1. Let T be the set of fluent names and let A be a finite set of sym- 
bols called action names, such that J-C\A = ^. An action law is a triple {C, a, E) 
where C, called condition, and E, called effect, are individually consistent sets 
affluent literals, composed of the very same set affluent names (i.e., \C\ = \E\) 
and a e A. If w is a state then an action law a = {C, a, E) is applicable in w iff 
C Cw. The application of a to w yields the state {w\C)U E, where \ denotes 
set subtraction. 

Thielscher’s approach formally incorporates causal information through 
causal relationships of the form 



€ causes p if ^ 

where e and p are fluent literals and # is a fluent formula based on E, the set of 
fluent names. 

Definition 2. Let (s, E) be a pair consisting of a state s and a set of fluent 
literals E. Then a causal relationship e causes p if 0 is applicable to (s, E) iff 
^A-ip is true in s, and e €. E. Its application yields the pair [s',E'), denoted as 
(s, E) (s', E'), where s' = (s\ {-■p}) U {p} and E' = (E\ {-’p}) U {p}. 

Intuitively, a causal relationship is applicable if the associated condition ^ 
holds, the particular indirect effect p is currently false, and its cause e is among 
the current effects — in other words, the cause has been effected, i.e., it has 
changed during causal propagation from false in the past to being true at the 
moment. Importantly, if the literal e is not among current effects, then it is not 
possible to apply the causal relationship - even if e is an element of a current 
state. 

States incorporating direct action effects may violate the underlying domain 
constraints.^ So, “in order to obtain a satisfactory resulting state, we compute 
additional, indirect effects by (nondeterministically) selecting and (serially) ap- 
plying causal relationships. If this procedure eventually results in a state satis- 
fying the domain constraints, then this is called a successor state” ([13]). More 
precisely, the set of possible successor states Resria, m), given an initial state w 
and an action a, is determined as follows. 

^ The details of an algorithm translating domain constraints and the influence relation 
into causal relationships are described in [13]. 
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Definition 3. Let be the set affluent names, A a set of action names, C a set 
of action laws, T> a set of domain constraints, and R a set of causal relationships. 
Furthermore, let w be a state satisfying T> and let a £ A be an action name. A 
state r is a successor state of w and a, r € ResT{a, w), iff there exists an 
applicable (with respect to w) action law a = {C,a,E) e L such that 

1. {{w\C)^ E,E) (r, E') for some E' , and 

2. r satisfies T>, 

where denotes the transitive closure of^. 

As mentioned before, an occurrence of a literal e in a state s does not guar- 
antee that a causal relationship e causes p if ^ is applicable to a pair {s,E) — 
to ensure applicability, the literal e has to belong to the current effects E. It is 
interesting to note, however, that given a transition pair {s,E), if the literal e is 
among current effects E, then it must be an element of the current state s. This 
observation can be formalised as follows. 

Lemma 1. If {s',E') {s",E"), then E" C s” . 

It is easy to observe that the set E' contains the most recent consistent effects 
that have taken place during the causal propagation {{w\C)\JE, E) (r, E'). 
In other words, although some of the effects may have been retracted from the 
effects set during propagation, their negations should have taken the respective 
places. The effects set is intended to account for both direct and indirect changes. 
However, it is not guaranteed that direct effects E axe always preserved by the 
propagation. On the contrary, they can be lost (the indirect effects can be lost 
as well — but this obviously is less counter-intuitive). 

Consider, for example, the simple action system with E = {p,g}, V — {-ig ^ 
-ip), R = {-ig causes ~^p i/T}, and £ = {({p, g})“){p>“'9})}- The action a 
performed at the initial state {p, g}, results in a state {p, ~'g}. Clearly, this 
resultant state does not satisfy the domain constraint. The causal relationship is 
then applicable, whereby ({p, -<q}, {p, ->g}) {{~‘P, ~'g}, {“'P, ~'g}) and produces 

Resr(a, {p, g}) = {->p, -^g}, where the successor state satisfies V, while leaving 
one of the direct effects (p) out. 

We can strengthen the concept of successor states to conservative successor 
states (denoted Res1^{a, S}) as follows. 

Definition 4. Let E, A, £, V, R, w, a ~ {C, a, E) be the same as in Defini- 
tion 3. A state r is a conservative successor state ofw and a, r £ Res’^(a, w), 

iff 

1. r £ Resr{a,w), and 

2. ECr. 

This definition allows the causal propagation to “travel” outside the 
£— states, but mandates that it finish in a state consistent with the direct ef- 
fects E. 
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3 Hyper-space Seiiicintics 

It has been previously argued [9] that a preferential structure with a binary 
relation on states demonstrates that minimal change and causality — the for- 
mer captured by preferential semantics and the latter by a binary relation — 
together are essential in furnishing a concise solution to the frame problem. Our 
approach here is intended to illustrate this idea once more, now with respect to 
Thielscher’s causal theory of action. More importantly, it is our contention that 
a pure preferential semantics, in the spirit of [12], cannot be obtained for causal 
action systems without extending the underlying language. Thus, in addition, 
the proposed approach may serve as another step towards a uniform preferential 
semantics for (extended) causal action systems. 

Our intention at this stage is to consider a formalisation of action systems 
which faithfully captures all successor states, as defined by Resr{a, w) (or 
Res"^{a, w)), using a simpler selection mechanism. More precisely, instead of 
keeping an explicit (and changing) account of context-dependent action effects, 
we would like to use a binary (causal) relation on states. The advantage of this 
proposal is that a causal relation would be action-independent, unlike a history 
of effects. Obviously, this objective is hardly achievable without extending the 
action system components in some way. 

Let us begin by informally describing the semantics we develop, before pro- 
ceeding to establish the formal results. An expansion of the standard state- 
space to a hyper-space of a larger dimension generates numerous hyper-states. 
Any state in the standard state-space can then be associated with a number 
of hyper-states, creating a hyper-neighbourhood. For instance, an intermediate 
state (defined, for a given action and an initial state, according to Thielscher’s 
approach) can be represented by a set of hyper-states in the expanded space. 
This hyper-neighbourhood will be a starting point of a propagation. An appro- 
priately constructed binary relation on hyper-states would allow us to propagate 
in the hyper-space in a very simple way — without the necessity to track the 
causal history, and resulting in a clearly defined “final” set of hyper-states. A pro- 
jection from the resulting hyper-neighbourhood back to the normal state-space 
would pinpoint the desired successor state of the action at hand. Intuitively, the 
purpose of the hyper-states is to serve as possible causal extensions of normal 
states, providing necessary context to the process of causal propagation. In the 
remainder of this section we give a formal description of this semantics. 

We suggest to extend the set of fluent names !F and incorporate more causal 
information in states themselves rather than rely on context-dependent causal 
propagation. We begin with definitions of an extended (hyper-) state. First of 

O 

all, we consider a set, denoted as T, of the same cardinality as the set T, such 

O O 

that Ljr n .F = 0. Then we define a function j \ T ^ T. Intuitively, the 

O 

element j(/) of the set T is an extra space-dimension, corresponding to the 

o o o 

fluent / G .F. Now let us consider the set Lj: = T {-ig : q G J-}. Clearly, 

O 

the cardinality of the set Ljr is equal to the cardinality of the set of fluent 
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literals Lj:, and Lj: fl Lj: = 0. Another function is needed to map from Lj: to 

O O 

Lj-, and we introduce the function I : Lj: Ljr, such that /(/) = j{f) if / € 

(/ is a positive literal — a fluent name), and 1(f) = ~’j(|/|) if / G \J^(fis 
a negative literal). 

The following property of the function 1(f) can be easily obtained. 

Lemma 2. If f S T, then l(^f) = ~'l(f)- 

The function 1(f) is intended to produce extra literals, corresponding to 
fluent literals in Lj:. We will call a literal 1(f) a justifier literal, and will use 

O O 

the abbreviation / instead of 1(f) for simplicity. In addition, the set T will be 

O 

referred to as the set of all justifler fluents, and the Lj: will be referred to as the 
set of all justifier literals. 

O 

Having deflned the function I, we can define a justifier set J for a set of fluent 
literals J as J = LifejW)} = U/ej{/}- 

Definition 5. Given a set of fluents T, a hyper-state is a maximal consistent 

O 

set of literals from Lj: U Ljr. 

We will denote the set of all hyper-states as Q, where the dimension of f? is 
2m, m being the dimension of W. The following two functions map hyper-space 
fl to normal space W and vice versa. 

Definition 6. A projection from fl to W, p : fl W, is the function mapping 

O O 

a hyper-state s = {/i, /i, — ,/«} € fl to a state r = {/i,...,/n} S W. 

We denote the hyper-part of a hyper-state s E f? as h(s) = s \ p(s). Clearly, 
for any sE.fl, h(s) f\fF = %. 

Definition 7. A hyper-neighbourhood of a state r E W, N : W —* 2^, is the 
function mapping a state r to a set of hyper-states: N(r) = {s E fl r = p(s)}- 

Clearly, there are 2"* states in any hyper-neighbourhood, as there are m jus- 
tifier fluent names in any hyper-state allowed to vary across the neighbourhood. 
Intuitively, justifier literals represent explicit causes for a set r € W. In other 
words, the set N(r) is the set of states where all possible causes vary, while the 
(proper) literals defined on IF are fixed. For example, given the state r = {a, 6} 
in the normal space W, one can consider its hyper-neighbourhood N(r) contain- 

qO o® O® o® 

ing hyper-states {a, b, a, b}, {a, b, a, -ih}, {a, b, -lO, b} and {o, 6, -lO, -ifc}, where the 

O O 

justifier fluents a and b vary. Hence any subset of N(r) may represent a particular 

Q O O ® 

causal context — the set {{a, b, a, 6}, {a, b, a, -ift}}, for instance, may correspond 
to a partial state {a,b,a}, justifying the literal a E r, and leaving the literal 
b E r somewhat unsupported (more precisely, any change in a truth value of a 
literal will be expected to have a justification). 
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It is worth noting that the history component E in any causally propagated 
pair (s, E) cannot have more elements than m — due to the consistency of the 
update defined in Definition 2, as shown by Lemma 1. In a simple case, when 
the history component J5 in a pair (s, E) has exactly m elements (or, in other 
words, E = s hy the Lemma 1), the pair can be easily represented by a single 

O . O 

hyper-state sUs. For instance, the hyper-state {a, 6, a, b} can account for a causal 
transition pair ({a, 6}, {a, b}). In the case when the component E has strictly 
less elements, E C s, the incompleteness may be represented by a partial hyper- 

O ^ O ^ 

state. A union of complete hyper-states, {{a, b, a, b}, {a, b, a, -ib}} can represent 
the pair ({a, b}, {o}) in a causal propagation chain where the second component 
carries the history of change {a}. 

It is precisely the combinatorial variability of possible causes in a hyper- 
neighbourhood that allows us to account for different action-dependent histories 
in a causally propagated chain, leading to a successor state in Resr{o-, w). 
Before we formally introduce the required notion of a binary causal relation on 
hyper-states, let us illustrate the intention with an example. 

Consider an action system with T = {a, b, c}, T> = {-ib — > ->a}, R = {-ib 
causes -■a i/T}, and C = {({6},x, {-ib})}. Let us perform the action x at the 
initial state w = {a,b,c}. The action direct effect, stored in an (initial) his- 
tory component, is {-’b}, and the intermediate state is, obviously, {a,-ib, c} = 
(to \ {b}) U {"lb}. This state violates the domain constraint, but the only causal 
law of the system is applicable: ({a, ~'b,c}, {-'b}) ({~'0) ~'b, c}, {-lO, ~'b}). The 

state component of the yielded pair satisfies the domain constraint and therefore 
belongs to Resr{x, w). It is easy to verify that Resr{x, w) is a singleton. 

Now, let us sketch how this simple propagation could be traced in the hyper- 
space. The hyper-neighbourhood N{r) of the intermediate state r = {o, -b, c} 
contains eight hyper-states, some of which represent the initial history compo- 

O 

nent {~'b} — these hyper-states are precisely the states in iV(r)n[-i6]. The hyper- 
neighbourhood of the successor state r' = {-lO, -ib, c} contains some hyper-states 
accountable for the final history component {-la, -ib}. These states are precisely 

O ® O ® 

the states in N(r') fl [->a A -'b] or N(r') D [->a] n [-ib]. Our intention, therefore, is 
to construct, for an action system, such a binary relation on hyper-states that a 
transition in hyper-space faithfully corresponds to causal propagation driven by 
an action-dependent history. 

Formally, we define a binary relation on states in i? as follows. 

Definition 8. A binary relation C is defined on Q x Q. We say that C{s, s') if 
and only if there exists a causal relationship e causes if such that 

1. p{s) h e A A -ip 

2. h{s) h € 

S. p{s') = {p{s) \ {^p}) U {p} 

4. h{s') = ih{s)\{-^p})U{p} 
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Figure 1 illustrates the generation of links by a causal relationship between 
hyper-states which belong to distinct hyper-neighbourhoods (the causal relation- 
ship is the same as in the example above). 



f t o O o> 

{a, -i6, c, a, b, c} 



r o o 

{-la, -lO, c, a, b, c} 



r 1 o o o- 

(a, ->6, c, a, 6, -»c} 



r t o o o. 

{-la, -lO, c, o, 6, -ic} 




O ° O. 



O O 



O O 



Fig. 1. The C- links between hyper-neighbourhoods of the states {o, c} and 
{-la, -i6, c}, generated by a causal relationship ->6 causes -lO ifT. 



O 

The feict that all the states in N{r) D [-i6] have links to the states in N{r') fl 

O O 

[-la A -ifc] is not a coincidence, and will be formally captured in a definition of a 
successor state. 

It is worth pointing out that the first condition in Definition 8 requires that 
the literal e is a part of the p{s) state — unlike Definition 2. However, Lemma 1 
illustrated that this requirement is implicit in Definition 2 as well ensuring that, 
in this respect, the new definition is not going to be more restrictive than the 
former one (formally, it will be shown later). 

A causal relationship e causes p if ^ may, upon translation, generate quite 
a few links between hyper-states. But causal propagation expressed in terms of 
these links is much clearer and simpler than that of Thielscher’s approach. 

4 Representation Theorems 

The following set will help in our analysis of causal links C(s, s'). Given two states 
X eW and y e W, the set L{x,y) = {s € N{x) : C(s, s'), for some s' € N{y)} 
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will be referred to as the connection set for the states x and y. In general, 
L{x,y) 7 ^ L(y,x). 

An important property of the relation C is that there are at least 2"*“^ links 
generated by one causal relationship (the minimum is attained when a causal 
relationship e causes p if^is qualified by a complete state: Afcr/fc)- This 

property leads to the following lemma. 

Lemma 3. For any two states x € W and y €W, if the connection set L(x, y) ^ 

O O 

0 then there exists a justifier literal f such that [/] fl N{x) C L{x,y). 

This lemma basically says that, if there is at least one C-link between two 
hyper-states, then there are at least — 1 more C-links between hyper-states 
in the respective neighbourhoods, and all these links are generated by the same 
causal relationship. 

O O 

Figure 2 illustrates the existence of a justifier literal ->b such that [~ih] fl 
N{{a, ->6, c}) C L({a, -ib, c}, {-m, -^b, c}). 




Fig. 2. All [-^bj-states belong to the connection setL({a, ->b, c},{-io, -ib, c}). 

It is possible to show that a qualified reverse observation holds as well. 
Lemma 4. For any two states x G W and y € W, if there exists a justifier 

O O 

literal f such that [/] fl A^(x) C L{x,y), then there exists a causal relationship f 
causes p if for some ^ and where {p} = y\x. 
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The proof of this lemma progressively eliminates all literals except /, which 
might have been alternative causes. It capitalises on the fact that varying m — 1 

O 

justifier literals (having fixed /) accounts for at most 2’"“^ — 1 states in an 

O 

hyper-neighbourhood, while there are states in the set [/] C\N(x). 

Together, the last two lemmas show that the presence of a causal relationship 

O 

underlying a C-link is equivalent to the existence of a justifier literal / such that 
lf]r\N{x) C L{x,y). 

Corollciry 1. For any two states x € W and y £ W, there exists a justifier 

O O 

literal f such that [/] C\ N{x) C L(x,y), if and only if there exists a causal 
relationship f causes p if where {p} — y\x, for some <?. 

It is not surprising to observe that any connection set may not contain all 
[e]-states and all [-ie]-states in any hyper-neighbourhood. Although the set R of 
causal relationships is allowed to include causal relationships like / causes p 
if ^ and -i/ causes p if such ( “contradictory” ) relationships would generate 
C-links originating from different hyper-neighbourhoods. So any given hyper- 
neighbourhood may have outcoming C-links generated by only one of the “con- 
tradictory” relationships. Formally, this observation is captured as follows. 

0 

Lemma 5. For any two states x 6 W and y £.W, there is no justifier literal e 
such that both [e] ft N{x) C L{x,y) and [-■€] n N{x) C L{x,y) hold. 

Before we define a successor state for an initial state w £.W and an action o, 
where {C,a,E) is the action law, we need to define one more construct — a 
trigger set of hyper-states s £ H, where the p{s) state is the nearest state to w, 
consistent with the direct effects E, and justifier literals in h{s) capture the 
initial (immediate) causal context. 

Definition 9. A trigger set of states ||£^||«, is defined for an initial state w £W 
and an action a, where {C, a, E) is the action law, as 

{s £ N{q) : q £W,q £ min{[E], -^w), h{s) I- E} 

where x y if and only if Diff{x, w) C Diffiy, w). 

Here Diff(p, q) represents the symmetric difference of p and q (i.e., {p\q)Li 
{q \p)) as in PMA [14]. 

In other words, ||£^||u, is the set contained in the hyper-neighbourhood N{q) 
of the state q nearest to the initial state w (in terms of the PMA ordering), 
and the states s £ ||F^||u, jointly represent the initial causally justified changes 
triggered by effects E. For example, consider an action law {({f)},^;, {-i6})}, 
applied at the initial state {a,b,c}. Then the trigger set ||{“'6}||{a,6,c} contains 
exactly the states placed in boxes in Figure 2. The following observation can be 
obtained from the definition immediately. 
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Lemma 6. For any initial state w eW and an action a, where {C, a, E) is the 

O 

action law, — E- 

Intuitively, what the states s € ||i^||to have in common in terms of justifier 

O 

literals, is precisely literals in E- 

Having defined a trigger set ||-El|,u, we can formally trace a causal propagation 
in the hyper-space Q. Let C* be the transitive closure of C. 

Definition 10. We say that a hyper-neighbourhood N{q), where q G W, is 
causally triggered by the set ||£^|lt«, denoted as |li^||iu >- N{q) if and only if 
Vs € ll-E'luj) 3s' e N{q), such that C*{s,s') holds. 

A case shown previously in Figure 2 was an instance (assuming a direct 
action effect ->b) when the trigger set does trigger the hyper- 

neighbourhood on the right-hand side. Figure 3 gives an example when the same 
trigger set fails to trigger the same hyper-neighbourhood — because not all the 
states in the set H{“'fe}||{a,6,c} belong to the given connection set. 




L ® f 

->b, c, a, b, c) 

, 0 ® Ov 

->6, c, a, ft, -ic| 

, ° 

-ift, c, a, -lb, c| 

1 O O o., 

->ft, c, o, -ift, -ic} 

o ® o, 

-lb, c, ->a, ft, cj- 
-lb, c, -la, ft, -ic} 

O O o. 

->6, c, -la, -lb, c| 

o ® O.. 

-lb, c, -la, -ift, -icj 



Fig. 3. The C-links between hyper-neighbourhoods of the states {a, -ib, c} and 

O 

{-la, -i6, c}, generated by a causal relationship c causes ->a if -ib. Some [-ift]- 
states do not belong to the connection set L{{a, -ib, c}, {~ia, -ib, c}). 



It is easy to check that the causal relationship which generated the connection 
set would not be applicable according to Thielscher’s approach as well — because 
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the cause (c) is not a part of a history component (equal to the direct eflfect at 
this stage). 

Intuitively, changes triggered by the set ||£^||u, propagate in hyper-space to- 
wards a hyper-neighbourhood of a possible successor state, tracing through some 
(causally triggered) hyper-neighbourhoods. 

Definition 11. A state s 6 is final if and only if {s' : C{s, s')} = %. A state 
r &W is final if and only if 'is € fV(r), s is final. 

Now we are ready to formally define a set of possible successor states 
Resn{a, w) intended to faithfully capture Thielscher’s resultant state set 
i?esr(a, w)- 

Definition 12. Let be a set of fluent names, A a set of action names, C a set 
of action laws, C a causal binary relation defined by Definition 8. Furthermore, 
let w SW be an initial state and let a € A be an action name. A state r e W 
is a successor state of w and a, r G Resf}(a, w), if and only if there exists an 
applicable (with respect to w) action law a = (C,a,E) G C such that ||£^||t(; >- 
N{r) and r is final. 

Alternatively, Resn(a, ru) = (r € iV : ^ N{r),r is final }. 

We will need the following lemma before establishing the desired representa- 
tion result. 

Lemma 7. If ||£||m C N{x), then ||£'||t„ >- N{y) for some y gW iff{x, E) 

{y, E') for some E' . 

This lemma establishes a principal parallel between a propagation in the 
hyper-space and causal propagation of Thielscher approach. 

The foregoing results now allow us to estblish the central result of this paper. 



Theorem 1. Resr{a, w) = Resn{a, w). 

Analogous results can be obtained for conservative successor states as well if 
we define Res*Q{a, w) = {r gW : 1|£^||«; N{r),r is final, E C r}. 

Theorem 2. Res1^{a, w) = Res*Q{a, w). 

5 Discussion 

The semantics proposed here extends the standard state-space to a hyper-space, 
and works by tracing the effects of actions (including indirect effects) in the 
hyper-space. The hyper-states are used to supply extra context to the process 
of causal propagation. 

These additional states are reminiscent of a semantics provided by 
Kraus et al. [5] for nonmonotonic consequence relations. It would be interesting 
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to extract a nonmonotonic consequence relation from the result function(s) in- 
vestigated here. This would facilitate a wider comparison with a wider class of 
logics for nonmonotonic reasoning. 

While we do not prove that the strategy proposed here is capable of furnishing 
a semantics for approaches to reasoning about action in general, we suggest that 
it is a fruitful strategy to pursue in supplying a unifying semantics for a large 
class of such frameworks. This is, at present, the subject of ongoing investigation. 

There are several questions that one is tempted to ask. Are hyper-states 
necessary? What is the minimum number of hyper-states required to characterise 
a given result function? These, too, will be left for future investigation. 
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Abstract. Dynamic Belief Networks (DBNs) have become a popular 
method for monitoring dynamical processes in real-time. However DBN 
evaluation has the same problems of computational intrsw:tability as or- 
dinary belief networks, with additioned exponential complexity as the 
number of time-slices increases. Several approximate methods for fast 
DBN evaluation have been devised [1,3,11]. We present a new method 
which simplifies evaluation by selectively “forgetting” past events and 
their relationships to the present. This is done by pruning, from past 
time-slices, arcs and nodes which axe deemed less relevant to the current 
time-slice, as determined by the arc weight measure introduced in [15]. 
This approach is more fiexible than a fixed-size window and can be com- 
bined with other approximate evaluation techniques. 



1 Introduction 

Dynamic Belief Networks (DBNs) [5,14,12] extend the basic framework of be- 
lief networks [16] (BNs) by introducing a temporal aspect into the graphical 
representation. The building block is a traditional BN, where variable states 
change in accordance to the dependencies encoded in the structure. This basic 
unit is repeated in a series, each instance corresponding to a specific time-slice. 
Arcs connecting nodes across time-slices represent the dynamic behaviour of the 
domain and indicate the probabilistic dependence of a variable upon past states. 

Exact and approximate methods developed for ordinary BN evaluation can 
also be used for DBN evaluation, with the same problem that undirected loops 
and large state-spaces lead to computational intractability. The temporal aspect 
of DBNs is an additional complicating factor, because as time-slices are added to 
the model, the total network size grows and evaluation complexity increases ex- 
ponentially. DBNs have become a popular method for real-time applications such 
as traffic monitoring [8], medical monitoring [3] and process control [9], where it 
is crucial for the model to provide an acceptable response in real-time. Several 
approximate methods for fast evaluation of DBNs have been devised [1,3,11] (see 
Section 2). In this paper, we present a new approximate method which simplifies 
evaluation by selectively “forgetting” past events and their relationships to the 
present. This is done by pruning regions of past time-slices which are deemed to 
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have less potential impact onto variables of the current time-slice, as determined 
by the arc weight measure. The pruning procedure creates a “tail” behind the 
current time-slice; the shape and size of the tail can be adjusted using a thresh- 
old parameter. This method offers more flexibility than a fixed-size window and 
can be combined with other evaluation techniques. 

To select arcs for removal, we use the arc weight measure based on mutual 
information introduced in [15] (Section 3). Our algorithm is described in Sec- 
tion 4, together with an example of its application. Empirical results showing the 
performance of the algorithm for two example networks are given in Section 5. 
In particular we look at the trade-off between the error and the computational 
complexity compared to fixed-window evaluation, and for different threshold val- 
ues. We conclude by indicating other approximate evaluation methods to which 
the arc weight measure may be applied. 



2 Related Work 

Algorithms developed for approximate evaluation of ordinary BNs such as [7,4] 
can also be applied to DBN evaluation and can be used in conjunction with 
methods which limit the size of the DBN by deleting arcs and/or nodes. The 
most common method of reducing DBN size is to maintain a “window” of a fixed 
number of time-slices [6], with a time-slice pruned off the past as each new time- 
slice is added. Using a larger window than 2 time-slices provides more accuracy 
at the expense of computation complexity. The disadvantage of any fixed window 
is that past slices are pruned completely, so that some dependencies, which may 
be important in the current context, are ignored. In this paper we provide a 
more selective pruning method. 

The basic idea behind our approeich is similar to that of Kjaerulff’s [13], that 
is, reducing computation complexity by the removal of weak dependencies. While 
our method deals directly with the network itself, Kjaerulff’s method involves the 
removal of arcs from the moralised independence graph. Also, his method does 
not directly exploit the structure of a DBN, although results are shown for the 
same WATER DBN that we investigate in Section 5.2. After simplifying the DBN 
structure using our algorithm, beliefs can be updated by any standard exact 
or approximate evaluation technique; however, the modified structure obtained 
by Kjaerulff’s procedure can only be used in conjunction with the Junction-tree 
evaluation algorithm. Hence, our method is more flexible. 

Dagum and Galper’s forecasting algorithm [3] also uses selective arc deletion 
to simplify computation while providing approximate beliefs. It performs A:-step 
ahead forecast by removing arcs to render every uninstantiated node a root node 
in earlier time-slices, with priors equal to the posterior forecast distribution of 
the node. Instead of doing this over one large DBN, they use a series of smaller 
sections. While this method is efficient if many nodes are instantiated, given a 
complex DBN where only a few nodes have evidence in each time-slice, then 
few arcs will be deleted and the reduction in computational complexity is not 
significant. In contrast to both Kjaerulff’s, and Dagum and Galper’s methods. 
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our method uses a threshold parameter which can be adjusted to control arc 
removal to provide the required computation performance. 

Boyen and Roller [1] use a 2 time-slice window for their evaluation method, 
which is based on the assumption that the current belief state summarises all his- 
torical information about the system. For each new slice, the window is “rolled” 
by copying values from the later slice to the earlier, and recomputing. They show 
that if a compact and approximate belief state is maintained over the current 
time, the error does not increase into the future, but remains bounded and even 
contracts. When testing our algorithm our experimental results also show the 
error is bounded over time. Their compact state is built by assuming the process 
can be decomposed into a number of weakly interacting subprocesses. They do 
not provide a method for automatically identifying good subprocesses; in [10] 
we show that our arc weight measure could be used for this purpose. 



3 Arc Weight Based on Mutual Information 



In this section we review the definition of an arc weight based on Mutual Infor- 
mation presented in [15]. Mutual Information (MI) [18,16] is a measure of the 
dependence between two random variables. It is the reduction in uncertainty 
of X due to knowing Y, and vice-versa. Since p{X,Y) = p{X)p{Y ]X), the MI 
between two variables X and Y can be written as: 

i{x,Y)=Y,p{x)j2p(y\^'> ^ ( 1 ) 

X y 

MI is symmetric, i.e. I{X, R) = I(Y,X). It is a non-negative quantity and is 
zero if and only if X and Y are mutually independent. 

Given a node Y with single parent X in a BN, the MI between X and Y 
describes the influence of X on R and vice-versa. The arc weight of X — > R is 
computed as the MI between X and R : 

w{X,Y)= Ppr(^ = i) E p(R = j|X = i) (2) 

iefi(X) jen(Y) 



where /2(X) denotes the state space of X and Ppr{X = i) denotes the prior 
probability of X being in state i. If X is a root node, its priors are stored in the 
BN. Otherwise Ppr(X = i) is approximated by averaging the conditional proba- 
bilities of X, over all parent state combinations. Similarly, because PprCT" = j) is 
not directly available in the BN, it is approximated by averaging the conditional 
probabilities of R over all states of X. 

Given a node R with parent X and a set of other parents Z = {Zq, Zn}, 
we defined the weight of arc X — » R as: 



w{X,Y) 



^ ^ Ppr(Z — k) ^ ] Ppr(X — i) 
k6fi(z) ien(x) 



^ p(R = j |X = i,Z = k)log 
jen(Y) 



p(Y = j \X = i,Z = k) 

Ppr(Y = j |Z = k) 



( 3 ) 
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Ppr{X = i) is obtained as for Eq 2. For each state of Y, Ppr{Y = j |Z = k) 
is approximated by averaging the conditional probabilities of Y over all state 
combinations of parents in Z. To estimate Ppr(Z = k), we first calculate Ppr{Z) 
for every Z G Z, then multiply out the joint probabilities for each combination 
of states, i.e. Ppr{Zo — lo,.,.,Zn = In) — rim=o...nPpr(^m ~ ^m)- This is 
equivalent to assuming all Z G Z are independent. 

We also defined a combined arc weight of multiple converging arcs on a 
node, estimating the combined effect of all parent nodes. Consider a node Y 
with parents X and Z: X Y Z. In this configuration, the combined weight 
of the converging arcs to child node Y from its parents X and Z is: 

W{Y,{X,Z})= Ppr(X = i) Ppr{Z = k) 

ien(x) ken(z) 

E ,4) 

jeo(y) 

Ppr{X), Ppr{Z) and Ppr(Y) are obtained using the same averaging procedure 
used for the single parent situation. In general, given node Y with a set of 
parents Z, the formula is: 

W{Y,Z)= Y, Ppr{2 = k) p(Y=j |Z = k) log (5) 

kcn(z) ,en(v) Ppr(y -J) 



4 The Algorithm 

Our algorithm for simplifying DBN evaluation using arc weights, called DBN-AW, 
is shown in Figure 1. Initially, the DBN contains two slices, TSq and TSi, and 
a parameter threshold is supplied as input to the algorithm. Evidence is en- 
tered into the current time-slice, the DBN is extended a time step and beliefs 
are updated.^ Since the beliefs for nodes in the current time-slice summarise 
the historical information about the system, we store them for later use (Step 
5(b)(ii)) as the node’s priors if its incoming arcs are deleted. 

Step 5 goes back through earlier time-slices, removing selected sets of con- 
verging arcs. For each node N, the weight W{N,parents{N)) of incoming arcs 
into N (see Section 3) is divided by the variable stepBack, which represents the 
distance in terms of time-slices from node N to the current slice. Since influence 
is generally decreased by distance [17], dividing in this way provides a heuris- 
tic estimate of the influence of deleted arcs onto the current queries. Note that 
when the CPDs are invariant in time (a common assumption for many applica- 
tions) the arc weights need not be computed for each new time-slice, but can be 
pre-computed off-line. 

The threshold parameter indirectly determines a tail size and shape. As time 
advances, arcs and nodes are removed from past slices, and eventually, an en- 
tire time-slice is deleted. When arcs are deleted, some past dependencies are 



^ As mentioned in Section 2, any standard evaluation algorithm may be used. 
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1. t = 0. 

2. Enter evidence into TSt- 

3. For N e TSt 

(a) update and return bel(N) 

(b) store a copy of bel{N) for use in step 5(b)ii 

4. If t < 2, go to step 7. 

5. For i = t to 1 do 

(a) stepBack = t — i + 1 

(b) For each node N G TSstepBack, if iV is not a 
root node and (W{N, par ents{N))/ stepBack) < 
threshold 

i. remove 2 ill incoming arcs into N 

ii. assign priors of the new root node N from 
the stored beliefs (Step 5) 

(c) stepBack = stepBack — 1 

6. Delete all discormected nodes. 

7. Add a time-slice TSt+i to the DBN. 

8 . t = t + l 

9. Go to step 2. 

Fig. 1. Algorithm DBN-AW. 



being “forgotten”. Deleting temporal arcs corresponds to forgetting the influ- 
ence of past events onto more recent events. On the other hand, deleting static 
arcs corresponds to forgetting dependencies between events which occurred si- 
multaneously in the past. Both cases rely on the assumption that past events 
and influences can be ignored, at an acceptable cost in accuracy, to simplify the 
computation of current beliefs. The forgetting process happens gradually and 
preferentially preserves relationships which impact more on the current time- 
slice. Note that variants of algorithm DBN-AW are possible; for example, instead 
of using W{N,parents{N)) as the measure and removing all incoming arcs into 
node N, each arc could be considered individually for removal based on the value 
of w{N,parent{N) for that arc. Results for such a variant are given in [10]. 

To illustrate the working of the algorithm we use a small example DBN, 
called dbn6, with 6 nodes per time-slice, obtained by taking a section of the 
water [9] network and preserving some of the original CPDs. Figure 2 shows the 
DBN generated by five steps of algorithm DBN-AW with a threshold value of .4 
for dbn6. At the fifth step, all the nodes from the TSo time-slice are deleted; the 
shape of the tail remains the same in further steps. 

Figure 3 shows dbn6 with different threshold values, which clearly determine 
the shape and size of the tail. For the higher threshold value of .6, more arcs are 
removed earlier and the nodes in time-slice TSo are deleted at the fourth step; 
this threshold gives a network that is very close to a fixed 4-time-slice window. 
The lower threshold value of .2 takes until the eighth step to delete the nodes 
of TSo and the DBN has a longer tail; it is further from a fixed window DBN. 
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TSO TSl TSO TSl TS2 TSO TSl T82 TS3 




Evidam in TSO Btrid<aai>TSl Bvidwtw IbT82 



TSO TSl TS3 TSl TS4 TSO 



TS2 1S3 TS4 TSS 




Evidence a TSl 




Fig. 2. Sequence of networks constructed for example DBN dbn6 using the 
DBN-AW algorithm with threshold=A. The nodes of the DBN at the first step 
are labelled with the state spaces (in brackets) and the combined weight W of 
incoming arcs (underlined). Arcs deleted by algorithm DBN-AW are shown dotted 
in the third, fourth and fifth steps. 



TSO TSl TS2 TS3 TS4 TS5 TS6 T57 T» 




Fig. 3. Example DBN dbn6 produced by algorithm DBN-AW with threshold=.2 
(above) and .6 (below). 
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5 Results 

Our results were obtained using the Netica software [2] on a Pentium 2 at 
300 MHz and 64Mb memory. Belief updating (Step 3(a)) is done using stan- 
dard exact Junction-tree evaluation. For the purposes of evaluation, we assume 
that the main purpose is monitoring, i.e. maintaining a belief distribution over 
the current world state, hence the beliefs for the current time-slice TSt are used 
for error estimation. The error is measured in terms of the Kullback-Leibler 
(KL) distance [16] between exact and approximate beliefs. When the algorithm 
is tested with evidence, results are plotted as a graph of the average error over 
50 runs against the time-slice. 



5.1 dbn6 DBN: Approximation Error 

Figure 4 (LEFT) shows results from DBN-AW with threshold values of .3, .6 
and .9, a 2 time-slice (2-TS) and a 6 time-slice (6-TS) fixed-window evaluation 
of dbn6 over 50 slices with no evidence. These errors are calculated by comparing 
to the beliefs from a complete 50-TS DBN. The best results are obtained with 
a 6-TS window, indicating that for this example network the state of the world 
at the current time-slice is largely independent of the world state more than 6 
time steps ago. The order of the curves for DBN-AW show that a larger threshold 
results in larger errors, due to more arcs being deleted. The errors are highest 
for threshold value .9, smaller with .6 and least with .3. 

Figure 4 (RIGHT) shows the results with evidence in each time-slice com- 
paring evaluation using 2-TS, 6-TS and DBN-AW with threshold — .5. The order 
of the curves show that 2-TS gives the worst error, DBN-AW with threshold = .5 
is better and the best performance is obtained using a 6-TS window. Z-tests on 
the averaged errors verified this hypothesis at the 5% significance level (see [10] 
for details of z-test results). Notice that the variance of the curves with evi- 
dence added at each step is more drastic than for no evidence; this effect is also 
mentioned by Boyen and Roller [1]. 

In Figure 5 (LEFT), the error curves are shown for evidence in each time- 
slice, with the threshold parameter for DBN-AW taking values .2, .4, .6 and .8. 
The error increases with increasing values of the threshold parameter, except for 
value .6 curve, which has a lower error than .4. The reason for this is that at 
thresholds. good approximate priors are obtained for nodes whose incoming 
arcs are deleted. Although this is not expected, it can happen “by chance”. 
On the whole, the general trend of the curves indicate an error increase as 
threshold increases, which z-tests confirmed was the case for most points at the 
5% significance level. 

Randomised CPDs. Since arc weights are calculated based on the values in 
the CPDs, for a given DBN structure, the CPDs will determine which arcs are 
deleted from past time-slices. Hence, the errors in current beliefs also indirectly 
depend on the network’s CPDs. To show the general applicability of the algo- 
rithm, we created five different versions of dbn6, with the same topology but 





Fig. 4. LEFT: results from DBN-AW with threshold values .3, .6 and .9, a 2 time- 
slice, and a 6 time-slice window on example DBN dbn6 over 50 time-slices with 
no evidence. RIGHT; results from DBN-AW with threshold value .5, a 2-TS and 
a 6-TS window on example DBN dbn6 over 50 time-slices with evidence entered 
in each time-slice. 



with the CPDs changed randomly. DBN-AW was then tested with the same values 
of threshold on the 5 networks, and the KL errors averaged over each threshold. 
Figure 5 (RIGHT) shows the results over 50 time-slices. The order of the curves 
show clearly that, as expected, errors grow larger with increasing threshold val- 
ues. Although some of the curves are very close, z-tests confirmed the ordering 
between the .09, .05 and .01 threshold curves at the 5% significance level, while 
threshold=.07 produced a significantly larger errors than both .06 and .05, but 
the difference between the .06 and .05 curves was not significant. 
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Fig. 6. Errors obtained DBN-AW with threshold^!. 15, .8, .6, compared to a 4-TS 
window on water DBN. 



5.2 water DBN: Approximation Error 

Unlike the reduced dbn6 example network, in our experimental environment the 
full water [9] network can only be evaluated exactly by Junction-tree for up to 
four time-slices. So, for the experiments shown here, we use the beliefs from a 
4-TS window of the network as “exact” beliefs against which to compare the 
DBN-AW algorithm, with threshold values of 1.15, .8 and .6. 

Figure 6 shows the performance of DBN-AW with threshold values of 1.15, .8 
and .6, by comparing these beliefs with those obtained using a 4-TS window. As 
expected for DBN-AW, the errors decrease with decreasing threshold, while the er- 
rors are seen to remain bounded over time. Note that we also obtained results for 
a 2-TS window; z-tests showed no significant difference between the 2-TS win- 
dow and DBN-AW with threshold^!. 15. However, z-tests confirmed that threshold 
values of .8 and .6 produced lower errors than 2-TS at the 5% significance level. 
The effect of the threshold on the error can also be seen by the average errors 
over the 50 time-slices, listed in Table 1. 



Algorithm 


Average KL error 


2TS- window 


.112 


DBN-AW, threshold=l.l5 


.060 


DBN-AW, threshold=.8 


.024 


DBN-AW, threshold=.& 


.003 



Table 1. Average error over 50 time-slices for 2TS- window and DBN-AW with 
threshold^!. 15, .8, .6 compared to a 4-TS window on water DBN. 
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5.3 Reduction in Evaluation Computation 

Now that we have seen how the size of the approximation error depends on the 
pruning threshold used by algorithm DBN-AW, let us consider the corresponding 
reduction in computational resources. The Join-Tree Cost (JTC) [6] is an ac- 
cepted measure for comparing the computational complexity of BN evaluation. 
Table 2 contains the results of comparing the Join-Tree Costs of the example net- 
works with and without using DBN-AW. The final column in this table, “Savings” 
is the fractional reduction in the JTC. It is calculated as: 

^ JTC{approximateDBN) 

JTC{ariginalDBN) 

For comparison, the reductions for a fixed 2-TS window are also included 
in the table. These results show that the reduction in computation cost using 
DBN-AW decreases as the threshold is increased, and that the 2-TS window pro- 
duces the largest reduction. Clearly this reduction for the 2-TS window is at the 
expense of a greater error in the approximate beliefs as seen in the results of the 
previous section. 



Network 


Original 

DBN 


JTC 


threshold 

for DBN-AW 


JTC 


SF 


dbn6 


6-TS 


336629 


.9 


17825 


.947 








.6 


20495 


.939 








.3 


53144 


.842 








2-TS 


2663 


.992 




50-TS 


9364712 


.9 


17825 


.998 








.6 


20495 


.998 








.3 


53144 


.994 








2-TS 


2663 


.999 


water 


4-TS 


17392941 


1.15 


2571699 


.852 








.8 


5313675 


.694 








.6 


5313837 


.694 








2-TS 


12519 


.999 



Table 2. Computational savings in terms of the Junction-tree cost for DBN-AW 
using various threshold values and for 2-TS on dbn6 and water. 



In [13], Kjaerulff reports the computational saving achieved by applying his 
method of links removal on the water network. A reduction of 97% is achieved 
after removing 126 links from the moral graph; this is at a cost of .001 in total 
divergence, which represents the total error introduced in the network when links 
are deleted. Using the DBN-AW algorithm with threshold=l. 15, the computational 
saving is about 85%, less than with Kjaerulff’s method. However, Kjaerulff’s 
global error measure does not provide control over which nodes of the network 
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bear most of the cost, in terms of error in bel'{Q), of simplifying the network 
structure. The issue of which nodes bear most of the error is important in DBN 
evaluation; for example, in the monitoring task, we would like to minimise error 
in nodes of the current time-slice and other nodes are less important. This is 
assured in the DBN-AW algorithm because arcs are deleted in past time-slices, in 
a way that least affects the beliefs of nodes of the current time-slice. Because 
Kjaerulff does not specify the actual error in specific query nodes, it is difficult to 
make a comparative assessment of his .001 figure to the error produced by our 
algorithm for water. The results reported by Kjaeruffi for water are with a fixed- 
sized window of four time-slices. For the purposes of assessing his algorithm, this 
is treated as an ordinary network and the dynamic features of the structure are 
not exploited. There are no indications of how his algorithm performs when the 
DBN is extended by adding more time-slices and of how the error is affected 
over time. Using DBN-AW, we can guarantee that evaluation is feasible as more 
time-slices are added and our results showed that the error is bounded over time. 



6 Conclusions and Future Work 



We have presented a new method for approximate evaluation of dynamic belief 
networks. The algorithm DBN-AW is based on arc weights: at each time-slice, arcs 
are deleted from past slices if their weight is less than a pre-specified threshold 
parameter. Nodes which become disconnected are also deleted. This corresponds 
to a process of gradually forgetting past events, the relationships between them 
and their possible impact on present events. Since arcs of smaller weight are 
deleted first, information most relevant to the current belief state is retained. 
By adjusting the threshold value, one can vary the amount of information that is 
discarded at each step. The question of how to set the threshold value in practice 
is an issue for future work. In general, the best threshold will depend on the level 
of accuracy desired and the range of possible values will vary according to the 
DBN being used. 

Our empirical testing of the algorithm on two example networks showed 
that the error introduced by the approximation stabilises over time and can 
be controlled by the threshold parameter. We also showed that our algorithm 
produces a corresponding reduction in computational complexity, as measured 
by the Join- Tree Cost. These results indicate that DBN-AW provides an efficient 
and more flexible alternative to using a fixed window for controlling the com- 
plexity of DBN evaluation. Our method can also be used in conjunction with 
other approximation algorithms; we intend to investigate the performance of 
such combinations of algorithms in the future. 

The arc weight measure used to select arcs for deletion is based on mutual 
information. We are applying it in a range of other approximate evaluation 
methods, such as selecting nodes for state-space abstraction and partial join- 
tree evaluation [10]. 
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Abstract. Kearns et al. (1997) in an earlier paper presented an empiri- 
cal evaluation of model selection methods on a specialized version of the 
segmentation problem. The inference task was the estimation of a pre- 
defined Boolean function on the real interveil [0, 1] from a noisy random 
sample. Three model selection methods based on the Guaranteed Risk 
Minimization, Minimum Description Length (MDL) Principle and Cross 
Validation were evaluated on samples with varying noise levels. The au- 
thors concluded that, in general, none of the methods was superior to 
the others in terms of predictive accuracy. In this paper we identify an 
inefficiency in the MDL approach as implemented by Kearns et al. and 
present sin extended empirical evaluation by including a revised version 
of the MDL method and another approach based on the Minimum Mes- 
sage Length (MML) principle. 



1 Introduction 

The segmentation problem occurs when there is a need to partition some data 
into distinct homogeneous regions. The specialized binary sequence problem 
framework considered in this paper was introduced by Kearns et al. [1]. An 
unknown Boolean function f(x) is defined on the real interval 0 < a; < 1. The 
interval is partitioned into (A: + 1) sub-intervals by k “cut points” {cj : j = l..k} 
which are uniformly and randomly distributed in [0,1] and indexed so that 
Cj < Cj+i, (j = 1..A: — 1). The function f{x) is defined to be 0 in even-numbered 
sub-intervals, and 1 in odd-numbered sub-intervals, the sub-intervals being num- 
bered from 0 to fc so that cut-point cj separates sub-intervals j — I and j. Data 
is generated from this model at N sample points {xj : i = l-.A"}. The Boolean 
datum Ui generated at Xj diflFers from /(xj) with probability p < (1/2). Thus 
the probability that yi = I alternates between p and (1 — p), depending on 
whether Xj lies in an even or odd sub-interval. 

The inference task, termed the “intervals model selection” problem by 
Kearns et al. [1], is to infer an approximation to the function /(x) from the 
data {xi, Pi : i = 1..N}. That is, we wish to infer the number and position of 
the outpoints (and, incidentally, the unknown “noise” rate, p). 
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The intervals model selection problem was originally employed by 
Kearns et al. [1] in their evaluation of different model selection methods. These 
selection methods included the two penalty-based methods, Vapnik’s Guaranteed 
Risk Minimization (GRM) [5] and a version (KMDL) [1] of Rissanen’s Minimum 
Description Length (MDL) Principle [4], and cross validation (hold out) [8]. The 
motivation for this choice of problem was that, while being non-trivial, it ap- 
peared to Kearns et al. to permit exact solution of the optimizations required 
by the different methods and hence seemingly offered a comparison of the meth- 
ods untainted by questionable mathematical approximations. Kearns et al. [1] 
reported that MDL performed no better than cross validation in this task. 

Unfortunately, their application (KMDL) of the MDL method was flawed, 
and hence the comparisons presented in [1] are misleading. This paper repli- 
cates the experiments of [1], omitting their implementation of Vapnik’s method 
(which appears correct). Minimum Message Length (MML) model selection [6] 
has been shown to perform significantly better than approaches based on GRM 
in the context of polynomial model selection [6]. We correct an approximation 
in KMDL, obtaining a slightly improved method which we term “CMDL”, al- 
though it is still an improper application (see section 6) of the MDL principle. 
Consistent with [1] , we find both KMDL and CMDL to perform relatively poorly 
unless the sample is large (with CMDL slightly superior). Finally, we develop 
a more correct MDL method, using the theoretical framework of the Minimum 
Message Length principle [2,3], with which we are more familiar. For this prob- 
lem, there seems little significant difference between MDL (properly applied) and 
MML. The poor behaviour of KMDL is again observed, but the MML method 
works well, and compares well with the cross-validation (CV) method which we 
implement in the same form as in [1]. 

2 Definitions 

This section presents standard definitions for all the terms used in this paper. 

1. S : training set, (xi,yi). 

2. N : sample size. 

3. p : true probability (noise rate). 

4. p: estimated probability. 

5. k : number of cuts. 

6. d ; number of alternations of label in S {d = k + 1). 

7. /(x) : true Boolean function from which S is generated. 

8. h{x) : learning algorithm’s estimate of /(x) from S. 

9. H{x) : binary entropy function given by — (xlogx 4- (1 — x) log(l — x)) 

We also define some standard error measures employed in the paper following 
notation used by Kearns et al. [Ij: 

— e{h) represents the generalization error of a hypothesis h{x) with respect to 
the target function /(x). e{h) = KL(f{x) jj h{x)), which is the Kullback- 
Leibler distance from /(x) to h{x). 
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— e{h) denotes the training error of h on sample S. 
e(/i) = \{xi,yi) e S : h{xi) ^ f{xi)\ /n. 



3 Kearns’s Intervals Model Selection Problem 

To test the various methods, Kearns et al. [1] chose a function f(x) with 100 
intervals each of length 0.01 (99 equally-spaced cutpoints). This function is the 
easiest to learn among all functions with 99 cuts. A randomly spaced set of 
cuts would increase the chance that some subintervals would contain few (or no) 
sample points, making them much harder to detect. In this study we employ 
generating functions with randomly-placed cuts. Note that none of the learning 
methods assume approximate or exact equality of subinterval lengths: they all 
assume the locations of the cuts to be random. 

A single test problem is generated from f{x) by fixing a sample size N and a 
noise probability, p. Then, N x- values are selected from the uniform distribution 
in (0,1), and for each Xj, a Boolean datum yi is generated as f{xi) XOR ran{p), 
where ran{p) is a random noise bit with probability p of being 1. Many replica- 
tions of a problem with given N and p are generated by making different random 
selections of the sample points and noise bits. 

For a given sample S, Kearns et al. [1] see the essence of the learning problem 
as being the selection of a model class, where a class Fk is the set of all alternating 
functions with k cuts. That is, the essence is the estimation or selection of k. 
Within a class Fk, a simple dynamic programming algorithm suffices to find 
the model function h^(x) with maximum likelihood, i.e. with minimum training 
error e(h^). Of course, the locations of the cutpoints of h(,(x) are determined 
only to within the interval between two adjacent sample points. In this work, we 
take the outpoint of hj.(x) which lies between Xi and Xi+i to be midway between 
the sample points. 

The learning task thus reduces to selecting a model from the set of model 
functions {h^(x); k = \..kmax)i where kmax is the largest number of cuts re- 
sulting in any reduction in training error. This set of models was then given to 
the two model selection methods based on GRM [5] and a version (KMDL [1]) 
of the MDL [4] principle. For the cross-validation method, the protocol was 
slightly different. The sample S was divided randomly into a 90% fitting set and 
a 10% validation set. A set of maximum-likelihood models {hk{x); k = \..k^a.x} 
was developed from the fitting sample, and the cross-validation method selected 
from that set the model with the lowest error on the validation sample. It is 
not claimed that this represents an optimal cross-validation, but it was chosen 
as a simple and representative application of the method. The generalization 
error for the model selected by each method was computed with respect to the 
problem target function f{x). 
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4 The “KMDL” Method 



We replicated the MDL encoding scheme employed by Kearns et al. in their pa- 
per. To distinguish from a modified method we consider later, we term it KMDL. 
According to them, MDL [4] is a broad class of algorithms with a common infor- 
mation theoretic motivation and each MDL algorithm is determined by a specific 
encoding scheme for both functions and their training errors. They present one 
such encoding scheme for the binary sequence problem. 

Let ft be a function with exactly k cut points. Description of ft(-) first re- 
quires specification of its number k of outpoints. The length of this description 
is neglected in KMDL. Given k, we can sufficiently describe ft(-) by specifying 
the k sample points immediately before which ft(-) changes value. Note that it 
makes sense for ft(-) to have a outpoint before the first sample point Xi, but not 
after the last, xjv. Thus, there are N places where ft(-) may have cuts. Assum- 
ing, as in [1], that the cuts are equally likely to occur in any of these places, 
specifying their locations takes log 2 (^) bits. Given ft(-), the training samples 
can simply be encoded by correcting the mistakes implied by ft(-). Suppose ft(x) 
differs from y{x) at m sample points, where m = N x i{h). (KMDL neglects the 
cost of describing m.) Given m, the identity of the m sample points where yi 
differs from ft(xi) can be specified with log 2 (^) bits. Thus, KMDL arrives at a 
total description length of 



log2 




bits 



In [1], expressions such as log 2 (^) are approximated by N x H{m/N) where 
H{) is the binary entropy function. Dividing by N leads to the KMDL choice 
of k: 



k = argmink{H{k/N) + H{e{h'k))} 



( 1 ) 



5 The “CMDL” Method 

We attempt a modest correction to the KMDL method, which we term CMDL. 
An important point that we note in developing our MDL approach is that as we 
are only considering the maximum-likelihood model in each class there cannot 
be any misses adjacent to a outpoint. Therefore, we can safely assume that the 
number of cuts k < AT— 2m. The encoding scheme includes the lengths needed for 
the description of k and m. The training error count m can certainly not exceed 
N/2, so the cost of encoding m is N/2 bits. The cost of specifying a outpoint 
from N — 2m potential cuts is log(iV — 2m) bits. We also replace the binary 
entropy approximation in KMDL by the accurate log-combinations expression. 
Thus, CMDL selects 



link I 



k = argmiuk < log(N — 2m) -I- log(iV/2) -I- log 



N - 2m 
k 






( 2 ) 
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6 The Flaw in KMDL 

While the minor correction of KMDL to CMDL has some beneficial effect, there 
remains a serious flaw in the coding scheme used in both methods. The use of a 
Minimum Description Length principle to select among competing model-based 
encodings of the data can make sense only if the coding scheme used with each 
competing model indeed minimizes the length of the description employing that 
model. The scheme used in KMDL and CMDL does not come close. 

These methods specify the location of the outpoints of a model to within 
the interval between adjacent sample points. On average, this is a precision 
of about (1/iV). Except for very low noise rates, such precise specification is 
unwarranted, leads to an over-long description, and vitiates the comparison of 
competing models. 

It is an essential feature of efficient model-based coding (whether MDL or 
MML) that no estimated parameter be specified more accurately than it can be 
estimated. Suppose, in this problem, that we decide to encode the outpoints 
so that they are always required to precede an odd-numbered sample point. 
That is we encode them to an average precision of {2/N). What effect will this 
have on the description length? First, we save approximately k bits, because 
for each of the k outpoints there are now only N/2 possible locations to be 
selected among. Second, for each cut point, there is a probability (1/2) that 
it is where we wanted it to be. If it is not, then one j/-value will be encoded 
using probability p instead of (1 - p) or vice-versa. The final result is that the 
description length is increased on average by (A:/2)(log2((l —p)/p) — 1) bits. 
This quantity is negative for p > 0.2, so by lowering the precision of cutpoint 
specification we actually shorten the description unless the noise rate is less than 
0.2. The MML method now described generalizes this approach. (For a detailed 
recent comparison between MML and MDL, see Wallace and Dowe [7].) 

7 MML Based Model Selection 

In essence, MML [2,3,7] seeks to miminize a message length defined by the joint 
encoding of the model and data given the model. It will be more convenient now 
to measure lengths in nits rather than bits (1 nit = log 2 e bits) so now we switch 
to natural logs. 

We start the message by stating the estimated k, with length log(IV/2). 
(Henceforth, we use k and p to denote the estimated model quantities.) Next, 
the message states p, the estimated noise rate. This value determines that y- 
values agreeing with h{x) will be encoded with length — log(l — p) each, and 
those disagreeing, with length — logp. 

The MML principle [3,7] offers the following general expression for computing 
the MML message length for parameter vector 6 and data x: 

MessLen = — \ogg{6) — log/(a:|0) -I- O.51ogF(0) — — log 12+ — (3) 
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where g{6) is the prior density on 0, f{x\6) is the likelihood of data x given 
0, D is the number of parameters and F{6) is the “Fisher Information”. In this 
application of the MML principle, 6 is the parameter p of a binomial sequence 
with n trials and m successes, so the relevant likelihood function is 

/(mlp)^p"‘(l-p)^-”‘ (4) 



Since we assume 0 < p < 0.5 but have no other prior information, we assume a 
uniform prior g{jp). 

The Fisher Information F(p) is easily shown [3] to be 



F(p) = 



N 

p(l - p) 



(5) 



The value of p that minimizes the message length can be derived [3,2] by 
differentiating the expression for MessLen to obtain. 



p = (to + 0.5)/(iV + 1.0) 



( 6 ) 



7.1 Encoding a Cut-Point 

As the cut-points are real-valued, it would require an infinite number of bits 
to specify them precisely. Thus, we need to find an optimal precision for our 
parameter specification. Let S be the precision (range) with which we want to 
specify our cut-point. We assume that the true cut-point lies within this range. 
Let e be the difference between the true cutpoint and the estimated one. Since 
we assume that our cut-point is uniformly distributed in S, the difference e is 
uniform in [-f , + §]• 

The expected number of data points in a region of width S is given by nS. 
Since our training values are uniformly distributed, the expected number of data 
points coded with the wrong probability (put in the “wrong” side of the cutpoint) 
is given by n|e|. The expected value of jc] can be derived as, 

rn/i IX 2 ^ 2 52 

_ 5 
~ 4 

The expected excess cost (in message length) of encoding a single data item 
with the wrong probability can be calculated by computing the difference be- 
tween the expected cost of encoding the data item with the correct and incorrect 
probabilities. Thus, the expected loss per wrong item can be derived as. 



(7) 



ExpLoss = — plog(l — p) — (1 — p)logp— (— plogp — (1 — p)log(l —p)) 

= p{log{p) - log(l - p)) + (1 - p)(log(l - p) - log(p)) 

p 1 — p 

= plog -I- (1 - p) log 

1 -p p 




Finding Outpoints in Noisy Binary Sequences 411 



= (2p-l)log^ 

= (2(l-p)-l)log^— ^ (8) 

P 

which is symmetrical between p and (1 - p). 

The expected increase in the cost of encoding the N5 /A ?/- values expected to 
be encoded with the “wrong” probability is 

N5 » 

ExcessCost = —r-C^P — 1) log (9) 

4 1 — P 

The outpoints of h{-) are parameters of the model and so must be encoded. 
To encode the position of a cut to precision S within the range (0,1) requires 
length — log J. Hence, for each cut point, the total cost incurred by encoding its 
position to precision 5 is 



N5 n 

TotalExcessPerCutpoint = — log<5 H — —{2p - 1) log (10) 

4 1 — P 



Differentiating TotalLossPerCutpoint in (11), we find the value of 6 that 
yields the minimum message length to be, 

4 

N{2p-l)\ogj^^ 

With this choice of S, from (9) and (11), the expected ExcessCost is just 
one nit per cutpoint. The total cost of encoding all k cutpoints to precision 5 is 

log(iV — 2m) — k\og{6/N) — log(fc!) 

since the order in which they are specified is immaterial and k < {N — 2m). As 
mentioned earlier, the use of an overly precise specification for 6 is only justified 
when the noise rate p is very low. Substituting 1/N (as used in 
Kearns et al. [1]) for S in expression (11) above we find that p ~ 0.018. In 
other words, the precision specified by Kearns et al. is only justified for quite 
extreme values, namely if p < 1 — 0.982 = 0.018. 



7.2 The Toted Message 

The length of the entire MML message can now be comput ed. The c omponents 
are the statement of k, the statement of p to precision \Jl2/F{p) within the 
range (0,1/2), the positions of the cutpoints to precision 5, and finally the ex- 
pected DataLoss of one nit per cutpoint resulting from imprecise specification 
of cutpoints. Given m, the identity of the m sample points where y, differs from 
h{xi) can be specified with code length log (^), which is included in the data 
part of our message. 
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In estimating the noise parameter p, the number of mistakes m made by 
the maximum- likelihood model h'j^(x) is increased by the expected additional 
number of disagreements resulting from the imprecision of outpoints. The re- 
sulting estimated error count is used in place of m in estimating p, which affects 
the choice of 6, and hence the estimated error count. A few iterations of these 
calculations converge quickly. 

As a result, our estimate of the noise rate exceeds m/N. The effect seems 
to correct for the overfitting of the maximum-likelihood model which, in KMDL 
and CMDL, leads to an underestimate of p. 

8 Results 

All four methods — MML, KMDL, CMDL and CV — were compared on noise 
rates of 0, 0.1, 0.2, 0.3 and 0.4 with sample sizes ranging from 100 to 3000 in 
steps of 100. For each choice of p and N, 100 replications were performed. The 
true cutpoint model consisted of 100 either evenly-spaced or randomly gener- 
ated outpoints, but here, due to space restrictions, we only include results from 
evaluation of all our methods on random cutpoint models. All methods were 
compared on the basis of the number of estimated cuts and the Kullback-Leibler 
(KL) distance between the true and estimated model. The Kullback-Leibler dis- 
tance (also known as the “relative entropy”) measures the expected excess cost 
of using an encoding based on the estimated model rather than the true model. 
Formally, the Kullback-Leibler distance between the true distribution p{y) and 
the estimated distribution q{y) is: 



KL {p\\q) = ^p{y)log^ (12) 

Finally, although as above, all methods were evaluated with both fixed and 
randomly spaoed outpoints, it is important to remember that the original state- 
ment of the problem [1] and all the models implicitly assume that the outpoints 
are randomly distributed. Figures 1-3 include comparisons of the KL distance 
and the number of estimated cuts for noise rates of 0.1, 0.2 and 0.3 respectively. 

Each figure plots the generalization error as measured by the KL distance 
and the estimated number of cuts collated. All plots represent averages over 100 
replications. With no noise (i.e., p = 0), all methods understandably performed 
well. But more important is how these methods perform given noise and ran- 
dom cut points. The robust performance of MML for small sample sizes in the 
presence of increasing noise values can be clearly observed from Figures 1-3. It 
is also important to observe that although MML tends to be conservative in 
estimating large numbers of cuts in comparison with CV, it does much better 
at minimizing the KL distance. Thus, in general, the MML approach performs 
significantly better than all the other methods evaluated when the models (cuts) 
are randomly distributed (as assumed in the problem framework). 




Estimated Cuts Kullback-Leibler Distance 
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Noise = 0.1, Cuts = too, Each point represents an average of tOO trials. 




Noise = 0.t, Cuts = too, Each point represents an average of 1 00 trials. 




Fig. 1. Evaluation of Different Methods with Random Cutpoints 






Estimated Cuts Kullback-Leibler Distance 





Estimated Cuts Kultback-Leibler Distance 
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Noise = 0.3, Cuts = 100, Each point represents an average of 100 trials. 
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9 Discussion 

The results shown here demonstrate that the poor behaviour of the “MDL” 
method reported by Kearns et al. (KMDL) [1] was not inherent in the MDL 
principle. Rather, it was caused by their failure properly to consider the mini- 
mization of the description lengths for each model, and in particular the need 
to encode estimates (here the cut positions) to an appropriate precision. 

The MML method developed here (which may equally well be considered an 
MDL method) is itself only a rough application of the principle, and still uses a 
sub-optimal coding scheme. In particular, it uses a constant outpoint precision S 
for all outpoints of the model, whereas our derivation of S clearly implies that the 
precision used for each outpoint should reflect the local density of sample points 
near the cut. In future work, we hope to develop a more carefully optimized MML 
method making proper use of knowledge of the sample point locations. Previous 
experience with MML leads us to expect that such a development would lessen 
the over caution of the present method in finding cuts in small samples. 
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Abstract. Q-learning can be used to learn a control policy that max- 
imises a scalar reward through interaction with the environment. Q- 
learning is commonly applied to problems with discrete states and ac- 
tions. We describe a method suitable for control tasks which require con- 
tinuous actions, in response to continuous states. The system consists of 
a neural network coupled with a novel interpolator. Simulation results 
are presented for a non-holonomic control task. Advantage Learning, a 
variation of Q-learning, is shown enhance learning speed and reliability 
for this task. 



1 Introduction 

Reinforcement learning systems learn by trial-and-error which actions are most 
valuable in which situations (states) [14]. Feedback is provided in the form of 
a scalar reward signal which may be delayed. The reward signal is defined in 
relation to the task to be achieved; reward is given when the system is successfully 
achieving the task. The value is updated incrementally with experience and is 
defined as a discounted sum of expected future reward. The learning systems 
choice of actions in response to states is called its policy. Reinforcement learning 
lies between the extremes of supervised learning, where the policy is taught by 
an expert, and unsupervised learning, where no feedback is given and the task 
is to find structure in data. 

There are two prevalent approaches to reinforcement learning: Q-learning and 
actor-critic learning. In Q-learning [16] the expected value of each action in each 
state is stored. In Q-learning the policy is formed by executing the action with 
the highest expected value. In actor-critic learning [4] a critic learns the value 
of each state. The value is the expected reward over time from the environment 
under the current policy. The actor tries to maximise a local reward signal from 
the critic by choosing actions close to its current policy then changing its policy 
depending upon feedback from the critic. In turn, the critic adjusts the value of 
states in response to rewards received following the actor’s policy. 

The main advantage of Q-learning over actor-critic learning is exploration 
insensitivity — the ability to learn without necessarily following the current pol- 
icy. However, actor-critic learning has a major advantage over current imple- 
mentations of Q-learning; the ability to respond to smoothly varying states with 
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smoothly varying actions. Actor-critic systems can form a continuous mapping 
from state to action and update this policy based on the local reward signal from 
the critic. Q-learning is generally considered in the case that states and actions 
are both discrete. In some real world situations, and especially in control, it is 
advantageous to treat both states and actions as continuous variables. 

This paper describes a continuous state and action Q-learning method and 
applies it to a simulated control task. Essential characteristics of a continuous 
state and action Q-learning system are also described. Advantage Learning [7] 
is found to be an important variation of Q-learning for these tasks. 



2 Q-Learning 

Q-learning works by incrementally updating the expected values of actions in 
states. For every possible state, every possible action is assigned a value which is a 
function of both the immediate reward for taking that action and the expected 
reward in the future based on the new state that is the result of taking that 
action. This is expressed by the one-step Q-update equation, 

Q(x,u) ;= (1 -a) Q(x,u)-|-Q!(i?-|- 7 maxQ(xt+i,Ut+i)) , (1) 

where Q is the expected value of performing action u in state x; x is the state 
vector; u is the action vector; R is the reward; a is a learning rate which controls 
convergence and 7 is the discount factor. The discount factor makes rewards 
earned earlier more valuable than those received later. 

This method learns the values of all actions, rather than just finding the 
optimal policy. This knowledge is expensive in terms of the amount of informa- 
tion which has to be stored, but it does bring benefits. Q-learning is exploration 
insensitive, any action can be carried out at any time and information is gained 
from this experience. Actor-critic learning does not have this ability, actions 
must follow or nearly follow the current policy. This exploration insensitivity 
allows Q-learning to learn from other controllers, even if they are directed to- 
ward achieving a different task they can provide valuable data. Knowledge from 
several Q-learners can be combined, as the values of non-optimal actions are 
known, a compromise action can be found. 

In the standard Q-learning implementation Q-values are stored in a table. 
One cell is required per combination of state and action. This implementation 
is not amenable to continuous state and action problems. 

3 Continuous States £uid Actions 

Many real world control problems require actions of a continuous nature, in 
response to continuous state measurements. It should be possible that actions 
vary smoothly in response to smooth changes in state. 

But most learning systems, indeed most classical AI techniques, are designed 
to operate in discrete domains, manipulating symbols rather than real numbered 
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variables. Some problems that we may wish to address, such as high-performance 
control of mobile robots, cannot be adequately carried out with coarsely coded 
inputs and outputs. Motor commands need to vary smoothly and accurately in 
response to continuous changes in state. 

Q-learning with discretised states and actions scale poorly. As the number of 
state and action variables increase, the size of the table used to store Q-values 
grows exponentially. Accurate control requires that variables be quantised finely, 
but as these systems fail to generalise between similar states and actions, they 
require large quantities of training data. If the learning task described in Sect. 7 
was attempted with a discrete Q-learning algorithm the number of Q-values 
to be stored in the table would be extremely large. For example, discretised 
roughly to seven levels, the eight state variables and two action variables would 
require almost 300 million elements. Without generalisation, producing this num- 
ber of experiences is impractical. Using a coarser representation of states leads 
to aliasing, functionally different situations map to the same state and are thus 
indistinguishable. 

It is possible to avoid these discretisation problems entirely by using learning 
methods which can deal directly with continuous states and actions. 

4 Continuous State and Action Q-Learning 

There have been several recent attempts at extending the Q-learning framework 
to continuous state and action spaces [17, 12, 11, 15, 6]. 

We believe that there are eight criteria that are necessary and sufficient for 
a system to be capable of this type of learning. Listed in in Fig. 1, these require- 
ments are a combination of those required for basic Q-learning as described in 
Sect. 2 combined with the type of continuous behaviour described in Sect. 3. 
None of the Q-learning systems discussed below appear to fulfil all of these cri- 
teria completely. In particular, many systems cannot learn a policy where actions 
vary smoothly with smooth changes in state (criteria Continuity). In these not- 
quite continuous systems a small change in state cannot cause a small change 
in action. In effect the function which maps state to action is a staircase — a 
piecewise constant function. 

Sections 4.1-4.6 describe various real valued state and action Q-learning 
methods and techniques and rate them (in an unfair and biased manner) against 
the criteria in Fig. 1. 

4.1 Adaptive Critic Methods 

Werbos’s adaptive critic family of methods [17] use several feedforward artificial 
neural networks to implement reinforcement learning. The adaptive critic family 
includes methods closely related to actor-critic and Q-learning. A learnt dynamic 
model assists in assigning reward to components of the action vector (not meeting 
the Model- Free criteria). If the dynamic model is already known, or learning one 
is easier than learning the controller itself, model based adaptive critic methods 
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Action Selection: 
State Evaluation: 



Q Evaluation: 

Model-Free: 

Flexible Policy: 

Continuity: 

State Generalisation: 

Action Generalisation: 



Finds action with the highest expected value quickly. 
Finds value of a state quickly as required for the Q- 
update equation (1). A state’s value is the value of highest 
vsdued action in that state. 

Stores or approximates the entire Q-function as required 
for the Q-update equation (1). 

Requires no model of system dynamics to be known or 
learnt. 

Allows representation of a broad reuige of policies to adlow 
freedom in developing a novel controller. 

Actions can vary smoothly with smooth changes in state. 
Generalises between similar states, reducing the amount 
of exploration required in state space. 

Generalises between similaur actions, reducing the amount 
of exploration required in action space. 



Fig. 1. Essential capabilities for a continuous state and action Q-learning system 



are an efficient approach to continuous state, continuous action reinforcement 
learning. 



4.2 CMAC Based ©-Learning 

Santamaria, Ashwin and Sutton [12] have presented results for Q-learning sys- 
tems using Albus’s CMAC (Cerebellar Model Articulation Controller) [Ij. The 
CMAC is a function approximation system which features spatial locality, avoid- 
ing the unlearning problem described in Sect. 6. It is a compromise between a 
look up table and a weight-based approximator. It can generalise between simi- 
lar states, but it involves discretisation, making it impossible to completely fulfil 
the Continuity criteria. In [12] the inputs to the CMAC are the state and action, 
the output is the expected value. To find Qmax this implementation requires a 
search across all possible actions, calculating the Q-value for each to find the 
highest. This does not fulfil the Action Selection criteria. 

Another concern is that approximation resources are used evenly across the 
state and action spaces. Santamaria et. al. address this by pre-distorting the 
state information using a priori knowledge so that more important parts of the 
state space receive more approximation resources. 



4.3 S-AHC 

Rummery presents a method which combines Q-learning with actor-critic learn- 
ing [11]. Q-learning is used to chose between a set of actor-critic learners. Its 
performance overall was unsatisfactory. In general it either set the actions to 
constant settings, making it equivalent to Lin’s system for generalising between 
states [10], or only used one of the actor-critic modules, making it equivalent 
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to a standard actor-critic system. These problems may stem from not fulfill- 
ing Q Evaluation, Action Generalisation and State Generalisation criteria when 
different actor-critic learners are used. This system is one of the few which can 
represent non-piecewise constant policies {Continuity criteria). 



4.4 Q-Kohonen 

Touzet describes a Q-learning system based on Kohonen’s self organising 
map [15, 8]. The state, action and expected value are the elements of the feature 
vector. Actions are chosen by choosing the node which most closely matches the 
state and a the maximum possible value (one). Unfortunately the actions are 
always piecewise constant, not fulfilling the Continuity criteria. 



4.5 Q-Radial Basis 

Santos describes a system based on radial basis functions [13]. It is very similar 
to the Q-Kohonen system in that each radial basis neuron’s holds a center vector 
like the Kohonen feature vector. The number of possible actions is equal to the 
number of radial basis neurons, so actions are piecewise constant (not fulfilling 
the Continuity criteria). It does not meet the Q Evaluation criteria as only those 
actions described by the radial basis neurons have an associated value. 



4.6 Neural Field Q-Learning 

Gross, Stephan and Krabbes have implemented a Q-learning system based on 
dynamic neural fields [6] . A neural vector quantiser (Neural Gas) clusters similar 
states. A neural field encodes the values of actions so that selecting the action 
with the highest Q requires iterative evaluation of the neural field dynamics. 
This limits the speed with which actions can be selected (the Action Selection 
criteria) and values of states found (the State Evaluation criteria). The system 
fulfils the State Generalisation and Action Generalisation criteria. 



4.7 Our Approach 

We seek a method of learning the control for a continuously acting agent func- 
tioning in the real world, for example a mobile robot travelling to goal loca- 
tion. For this application of reinforcement learning, the existing approaches 
have shortcomings that make them inappropriate for controlling this type of 
system. Many can’t adequately generalise between states and/or actions. Others 
can’t produce smoothly varying control actions or can’t generate actions quickly 
enough for operation in real time. For these reasons we propose a scheme for 
reinforcement learning that uses a neural network and an interpolator to ap- 
proximate the Q-function. 
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5 Wire-Fitted Neur£il Network Q-Learning 

Wire-fitted Neural Network Q-Learning is a continuous state, continuous action 
Q-learning method. It couples a single feedforward artificial neural network with 
an interpolator ( “wire- fitter” ) to fulfil all the criteria in Fig. 1. 

Feedforward Artificial Neural networks have been used successfully to gener- 
alise between similar states in Q-learning systems where actions are discrete [10, 
11]. If the output from the neural network describes (non-fixed) actions and their 
expected values, an interpolator can be used to generalise between them. This 
would fulfil the State Generalisation and Action Generalisation criteria. 

Baird and Klopf [2] describe a suitable interpolation scheme called “wire- 
fitting” . The wire-fitting function is a moving least squares interpolator, closely 
related to Shepard’s function [9] . Each “wire” is a combination of an action vec- 
tor, u, and its expected value, q, which is a sample of the Q-function. Baird and 
Klopf used the wire-fitting function in a memory based reinforcement learning 
scheme. In our system these parameters describing wire positions are the output 
of a neural network, whose input is the state vector, x. 

Figure 2 is an example of wire-fitting. The action is this case is one dimen- 
sional, but the system supports many dimensional actions. The example shows 
the graph of action versus value (Q) for a particular state. The number of wires 
is fixed, the position of the wires changes to fit new data. Required changes 
are calculated using the partial derivatives of the wire-fitting function. Once 
new wire positions have been calculated the neural network is trained to output 
these new positions. 





Fig. 2. The wire-fitting process. The action (w) is one dimensional in this case. Three 
wires (shown as o), this is the output from the neureJ network for a particular state. 
The wire-fitting function interpolates between the wires to calculate Q for every u. 
The new data (*) does not fit the curve well {left), so the wires are moved according 
to partial derivatives (right). In other states the wires would be in different positions. 



The wire-fitting function has several properties which make it a useful inter- 
polator for implementing Q-learning. 
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Updates to the Q-value (1) require Qmax{x,u), which can be calculated 
quickly with the wire- fitting function (the State Evaluation criteria). 

The action u for Qmax (x,u) can also be calculated quickly (the Action Se- 
lection criteria). This is needed when choosing an action to carry out. A property 
of this interpolator is that the highest interpolated value always coincides with 
the highest valued interpolation point, so the action with the highest value is 
always one of the the input actions. When choosing an action it is sufficient 
to propagate the state through the neural network, then compare the output q 
to find the best action. The wire-fitter is not required at this stage, the only 
calculation is the forward pass through the neural network. 

Wire-fitting also works with many dimensional scattered data while remain- 
ing computationally tractable; no inversion of matrices is required. Interpolation 
is local, only points nearby influence the value of Q. Areas far from all wires have 
a value which is the average of q, wild extrapolations do not occur (see Fig. 2). 
It does not suffer from oscillations, unlike most polynomial schemes. 

Importantly, partial derivatives in terms of each q and u of each point can 
be calculated quickly. These partial derivatives allow error in the output of the 
Q-function to be propagated to the neural network according to the chain rule. 

This combination of neural network and interpolator stores the entire Q func- 
tion (the Q Evaluation criteria). It represents policies in a very flexible way; it 
allows sudden changes in action in response to a change in state by changing 
wires, while also allowing actions to change smoothly in response to changes in 
state (the Continuity and Flexible Policy criteria). 

The training algorithm is shown in figure 3. 

Training of the single hidden layer, feedforward neural network is by incre- 
mental backpropagation. The learning rate is kept constant throughout. Tan- 
sigmoidal neurons are used, restricting the magnitude of actions and values to 
between 1 and -1. 

The wire-fitting function is 



Q {x, u) = lim 



E " gdae) 

t=0 ||«-tii(®)|12-(-c(9max(»)-gi (*))-!-« 
1 



lim 



g<(=g) 

i=0 distance{a:,u) 



£_,0+ V" 1 

Z^i=0 distance{sCfU) 
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e_o+ norm{x,u) 



( 2 ) 



where i is the wire number; n is the total number of wires; x is the state vector; 
Ui (x) is the ith action vector; qi (x) is the value of the fth action vector; u 
is the action vector to be evaluated, c is a small smoothing factor and e avoids 
division by zero. The dimensionality of the action vectors u and Ui is the number 
of continuous variables in the action. The two simplified forms shown simplify 
description of the partial derivatives. The partial derivative of Q from (2) in 
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1. In real time, feed the 
state into the neural net- 
work. Carry out the action 
with the highest q. Store the 
resulting change in state. 



2. Calculate a new estimate 
of Q from the current value, 
the reward and the value of 
the next state. This can be 
done when convenient. 
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3. From the new value of Q 
calculate new values for u 
and q using the wire-fitter 
partial derivatives. Train the 
neural network to output the 
new u and q. This can be 
done when convenient. 
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Fig. 3. Wire-fitted neural network training algorithm. 



terms of q (cc)j, is 

dQ _ norm (x, u) ■ (distance (x,u) + qk • c) — wsum{x,u) ■ c 
OQk c— 0 + [norm (x, u) • distance (x, u)] 

Equation (3) is inexact when = ^max- The partial derivative of Q in terms of 
u (x)^ j is 



dQ _ [wsum (x, u) — norm (x, u) ■ qk] - 2 ■ (uk,j — Uj) 
duk,j e-»o+ [norm (x, u) • distance (x, u)]^ 



(4) 



where j selects a term of the action vector (uj is a term of the chosen action). 
The summation terms in (3) and (4) have already been found in the calculation 
of Q with (2). 

With partial derivatives known it is possible to calculate new positions for 
all the wires uo...n and qo...n by gradient descent. As a result of this change the 
Q output from the wire-fitter should move closer to the new target Q. 
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6 Practical Issues 

When the input to a neural network changes slowly a problem known as un- 
learning or interference can cause the network to unlearn the correct output for 
other inputs because recent experience dominates the training data [3] . We cope 
with this problem by storing examples of state, action and next state transitions 
and replaying them as if they are being re-experienced. This creates a constantly 
changing input to the neural network, known as a persistent excitation. We do 
not store target outputs for the network as these would become incorrect through 
the learning process described in Sect. 5. Instead the wire-fitter is used to cal- 
culate new neural network output targets. This method makes efficient use of 
data gathered from the world without relying on extrapolation. A disadvantage 
is that if conditions change the stored data could become misleading. 

One problem with applying Q-learning to continuous problems is that a single 
suboptimal action will not prevent a high value action from being carried out at 
the next time step. Thus the value of actions in a particular state can be very 
similar, as the value of the action in the next time step will be carried back. As 
the Q-value is only approximated for continuous states and actions it is likely 
that most of the approximation power will be used representing the values of the 
states rather than actions in states. The relative values of actions will be poorly 
represented, resulting in an unsatisf£ictory policy. The problem is compounded 
as the time intervals between control actions get smaller. 

Advantage Learning [7] addresses this problem by emphasising the differences 
in value between the actions. In Advantage Learning the value of the optimal 
action is the same as for Q-learning, but the lesser value of non-optimal actions 
is emphasised by a scaling factor {k oc At). This makes a more efficient use of 
the approximation resources available. The Advantage Learning update is 

A {x, u) := {l-a)A (x, u) 

-f O' (i?-l-7*'max^(xt+i,ut+i)) -H (l - j) max>I(xt,U()] , (5) 

where A is analogous to Q in (1). The results in Sect. 7 show that Advantage 
Learning does make a difference in our learning task. 

7 Simulation Results 

We apply our learning algorithm to a simulation task. The task involves guiding a 
submersible vehicle to a target position by firing thrusters located on either side. 
The thrusters produce continuously variable thrust ranging from full forward to 
full backward. As there are only two thrusters (left, right) but three degrees on 
freedom (x, y, rotation) the submersible is non-holonomic in its planar world. 
The simulation includes momentum and friction effects in both angular and 
linear displacement. The controller must learn to slow the submersible and hold 
position as it reaches the target. Reward is the negative of the distance to the 
target (this is not a pure delayed reward problem). 
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Fig. 4. Appearance of the simulator for one run. The submersible grarluailly learns to 
control its motion to reach taurgets 



Figure 4 shows a simulation run with hand placed targets. At the point 
marked zero the learning system does not have any knowledge of the effects of 
its actuators, the meaning of its sensors, or even the task to be achieved. After 
some initial wandering the controller gradually learns to guide the submersible 
directly toward the target and come to a near stop. 

In earlier results using Q-learning alone [5], the controller learned to direct 
the submersible to the first randomly placed target about 70% of the time. Less 
than half of the controllers could reach all in series of 10 targets. Observation of 
Q-values showed that the value varied only slightly between actions, making it 
diSicult to learn a stable policy. In our current implementation we use Advantage 
Learning (see Sect. 6) to emphasise the differences between actions. We now 
report that 100% of the controllers converge to acceptable performance. 

To test this, we placed random targets at a distance of 1 unit, in a random 
direction, from a simulated submersible robot and allowed a period of 200 time 
steps for it to approach and hold station on the target. For a series of targets, the 
average distance over the time period, was recorded. A random motion achieves 
an average distance of 1 unit (no progress) while a hand coded controller can 
achieve 0.25. The learning algorithm reduces the average distance with time, 
eventually approaching hand coded controller performance. Recording distance 
rather than just ability to reach the target ensures that controllers which fail to 
hold station don’t receive a high rating. 

Graphs comparing 140 controllers trained with Q-learning and 140 trained 
with Advantage Learning are shown in the box- and- whisker plots in Fig. 5. The 
median distance to the target is the horizontal line in the middle of the box. The 
upper and lower bounds of the box show where 25% of the data above and below 
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Fig. 5. Performance of 140 learning controllers using 2-learning {left) and Advantage 
Learning {right) which attempt to reach 40 targets each placed one distance unit away 



the median lie, so the box contains the middle 50% of the data. Outliers, which 
are outside 1.5 times the range between the upper and lower ends of the box 
from the median, are shown by a “+” sign. The whiskers show the range of the 
data, excluding outliers. Advantage Learning converges to good performance 
more quickly and reliably than Q-learning and with many fewer and smaller 
magnitude spurious actions. Gradual improvement is still taking place at the 
40th target. The quantity of outliers on the graph for Q-learning show that the 
policy continues to produce erratic behaviour in about 10% of cases. 

When reward is based only on distance to the target (as in the experiment 
above) the actions are somewhat step like. To promote smooth control it is 
necessary to punish for both energy use and sudden changes in commanded 
action. Such penalties encouraged smoothness and confirmed that the system is 
capable of responding to continuous changes in state with continuous changes in 
action. A side effect of punishing for consuming energy is an improved ability to 
maintain position. 

8 Conclusion 

A practical continuous state, continuous action Q-learning system has been de- 
scribed and tested. It was found to converge quickly and reliably on a simulated 
control task. Advantage Learning was found to be an important tool in over- 
coming the problem of similarity in value between actions. 
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Abstract. An approach is investigated to identify an invariant in the 
team’s behavior in order to cheiracterize the global behavior of a robot 
soccer team. The existence of such an invariant confirmed by computa- 
tional evidence is reported in this paper. The invariant is being studied 
with the view of using it as a possible means for developing intelligent 
strategies in robot soccer. . . . 
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mathematical foundations, artificial life. 



1 Introduction 

The study of Robotics and Intelligent machines is becoming an important dis- 
cipline in Artificial Intelligence [1]. Robot Soccer is used as a test bed for many 
branches of research that have arisen in this discipline. This paper gives atten- 
tion to understanding the global behavior of teams in a Robot Soccer simulator. 
The aim is to identify any invariants in the team’s behavior that can be used as 
its global signature. We report that the approach used in the paper is confirmed 
by computational evidence and indicates the existence of such an invariant. The 
invariant is being studied with the view of using it as a possible means for de- 
veloping intelligent strategies in robot soccer. 

Usually in the game of soccer between teams of human players, enthusiasts 
of the game have an intuitive understanding of the style of play exhibited by 
a particular team. The global behavior may be designated by terms like: an 
offensive team, a defensive team, the Italian style, the Brazilian approach etc. 
Because the style of robot soccer teams has been encoded algorithmically it 
should be able to be described and understood formally. It is hoped that the 
invariant will be useful as an identifier of the robot team’s style. 

2 Robot Soccer 

The organizing of several world robot soccer competitions by the Federation of 
International Robot-soccer Association (FIRA) [2] and the RoboCup Federa- 
tion [3] has stimulated much research in the field of robot soccer. In 1996 the 

N. Foo (Ed.): AI’99, LNAI 1747, pp. 429-439, 1999. 
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Micro-Robot World Cup was held in Korea [4]. During the 1997 World Cup tour- 
nament FIRA was organized [5] and the categories NaroSot, MiroSot, RoboSot, 
and KheperaSot were established. These categories differ from each other in the 
number or size of robots on the playing field. 

The RoboCup was initiated by a group of researchers into Artificial Intelli- 
gence in Japan. This competition is split into leagues, the Small Size League, 
Middle Size League, and the Simulator League. The Simulator League is a test 
bed for robot team strategies played in simulation on the official RoboCup Sim- 
ulator [6]. In 1998 the World Cup for both competitions was held in Paris and 
in the year 2000 to coincide with the Olympic games in Sydney, the World Cup 
for both competitions will be held in Australia. The FIRA competion will be 
hosted by Central Queensland University in Rockhampton [7] and RoboCup will 
be hosted by RMIT University in Melbourne [8]. 

2.1 The Game of Robot Soccer 

In both Robot Soccer competitions the game is played on a soccer field that is 
scaled down to about the size of a ping-pong table [2,3] (9 tables for RoboCup 
middle sized league MSL [3]). Depending on the category, there are between one 
and five robots on each team. The robots attempt to move a golf ball (200mm 
ball for MSL) into the opponent’s goal. In general the RoboCup competition has 
larger robots that move slower than in the FIRA competition. FIRA is seeking 
to have micro and nano technology built into its robots. 

Usually a central computer is used to transmit instructions to each robot 
based on information given by a global vision system except in MSL where all 
systems are on board the robot. The goal of both competitions is to have fully 
autonomous robots with on-board vision and control systems. 

2.2 Robot Soccer Simulators 

Autonomous robots must perform many complex computations such as coopera- 
tion with team members [1,9,10] coordination of movements [11], interpretation 
of visual signals [12], collision avoidance [13], machine learning [14] and calcula- 
tion of trajectories [15]. These computational requirements are currently far too 
great to physically fit inside a robot. For this reason simulators are being used to 
test algorithms that improve the strategies and global behavior of robot soccer 
teams. Consequently understanding global behavior becomes a very important 
problem. 

Some of these simulators can be found on the web and freely downloaded 
(see Table 1). 

2.3 ASCII Soccer Simulator 

The ASCII Soccer simulator is based on a text screen with a field 78 characters 
long and 21 lines wide. A team consists of 4 “>” characters playing against 4 




On Understanding Global Behavior of Teams in Robot Soccer Simulators 431 



Table 1. Robot Soccer Simulators and the internet address where they can be 
downloaded. 



Simulator 


Internet address 


ASCII Soccer 


http://www.cc.gatech.edU/grads/b/Tucker.Balch/soccer/ 


Java Soccer 


http://www.cc.gatech.edU/grads/b/Tucker.Balch 
/ JavaBots/EDU /gatech/cc/is/docs/index.html 


RoboCup Soccer Server 


http://ci.etl.go.jp/ noda/soccer/server/DownLoad.html 



“<” characters with an “o” as the ball. The goals are the left hand and right 
hand sides of the field. The ASCII Soccer simulator was used in this research due 
to its simplicity and the ease of extracting the data of the successive positions 
of play throughout the game. 

Thirteen different teams can be downloaded with the ASCII Soccer simulator. 
The strategies that are encoded in the teams vary from random players who move 
in any direction, to position-based strategies and continuous feedback learning 
and control. 



3 An Approach to Understand Team Behavior 

An approach used in this paper is based on the theory of Random Matrices [16] 
and Structural Complexity [17,18]. It may been seen as analogous to the spec- 
tral analysis procedure that physicists use to identify substances by how the 
substance absorbs and emits photons [19]. This approach is employed to identify 
an invariant in the team’s behavior that can be used as a global signature. The 
invariant is being studied with the view of using this characterization of global 
behavior as a possible approach to developing intelligent strategies in robot soc- 
cer. 

The core of this approach is based on eigenvalue statistics of matrices related 
to the dynamics of robots. The movements of the team of robots throughout 
the game are captured by a sequence of matrices. The approach is confirmed 
by computational evidence given in the paper that indicates the existence of 
an invariant. Computational results show that the statistics derived from the 
eigenvalues of these matrices are consistent over a number of games against 
the same opponent and also against different opponents. This encourages us to 
consider them as a signature of the team’s global behavior in the same way as a 
spectrum identifies the behavior of atoms in a molecule. 



3.1 Constructing Matrices that Describe Team Movements 

In ASCII soccer the playing field is naturally divided into a grid 78 by 21. In order 
to speed up computations by decreasing the size of the matrix the resolution was 
changed to a grid 39 by 11. Starting from the bottom left-hand corner of the grid 
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and moving from left to right and then to the next line, each cell was numbered 
from 1 to 429. 

At each instant of time the robot players will be in a cell, denoted Si. A 
“frame of action” represents the cell positions for the whole team at a given 
instant of time. Corresponding to the change in resolution, only every second 
frame of action was considered. 

The movement from one frame of action to the next is captured by a 429 by 
429 symmetric matrix A = (oy), i,j = 1, . . . ,429 where Oy = aji equals 1 if a 
robot moves from cell Si to Sj. The construction of matrix A is illustrated by 
the following example. The data in Table 2 shows the cell positions of the team 
players for frame 1 and frame 2. In the matrix A of zeros the following elements 
are changed to ones (15, 54), (54, 15), (132, 172), (172, 132), (249, 289), (289, 
249), (366, 367), (367, 366). 



Table 2. Table 2 Data for illustration of matrix construction 





IB 


B 


B 


4 


FVame 1 


m 








Frame 2 


m 









This method of encoding has been used in order to preserve symmetry in the 
matrix and as a result the following situations arise. When a robot remains in 
the same cell it will contribute only one element which is on the diagonal of the 
matrix. In the case where two robots swap positions from one frame to the next, 
the matrix A would appear to “lose” a robot, i.e. only two ones would appear in 
A whereas normally four ones would encode the movements of the two robots. 
It can be shown that generality is not affected by these simplifications and we 
plan to take this into account in our future research. 

3.2 Distribution of Eigenvalues 

Statistics of spacings between eigenvalues has been found to be very useful in 
categorizing global behavior in the field of quantum chaology [20]. In particular 
it has been found that the probability distribution of these spacings reveal an 
invariant that is useful in classifying phenomena on the boundary between the 
classical and quantum realms. 

The matrices A^, At which encode the movements of the team of robots 
from one frame of action to the next over t + 1 frames are considered individu- 
ally. The eigenvalues calculated for each matrix are arranged in ascending order 
Ai,...,A„. All but one of the zero eigenvalues are ignored by the process of 
considering the differences (spacing) d, between sequential eigenvalues given by: 
d = Xi — Aj+i if Ai and Aj+i are both not zero. These spacings are stored for 
all Ai, . . . , A*. This data is then displayed as a probability histogram. 
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3.3 Results 

Figure 1 shows the graphs of these statistics for 3 games played by the Soccer 
Spaniels team against the Dynamic Rollers. Each bar represents the probability 
of having a spacing between eigenvalues at its position on the x axis. Figure 2 
shows the statistics for the Dynamic Rollers team for the 3 games. In both these 
figures the interval width for the sample is 0.0033... , which gives a very high 
resolution. Figure 3 and Figure 4 display the data collected over 5 games between 
the same two teams using a much lower resolution. Here the interval width is 
0.2. 

It is to be noted that the algorithms for the two teams use different ap- 
proaches and are programmed by different authors. Quite surprisingly it is ob- 
served that the statistics for each team are very similar for each of the five games 
and appear as our results show, to be a signature of the global behavior for each 
team. That the games were completely different can be seen by the different 
game results as shown in Table 3. 



Table 3. Scores for 5 games between Soccer Spaniels and Dynamic Rollers 
simulated robot soccer teams. 



Team/Game 


1 


2 




4 


5 


Soccer Spaniels 


7 


2 


10 


T 


10 


Dynamic Rollers 


10 




It 


10 


9 



The Soccer Spaniels and the Dynamic Rollers teams also were pitted against 
other teams that had been downloaded with the ASCII Soccer simulator and 
remarkably the same signatures were consistently found for each team. The other 
teams also were seen to have their own individual signatures. It appears that 
the signature is an invariant of the team’s global behavior and may be a formal 
way of describing the team’s style of play. These similarities are established by 
the usual statistical methods. 



4 Discussion 

An approach for identifying a global invariant of a robot soccer team is confirmed 
by computational results that indicate the existence of such an invariant as 
presented in this paper. This encourages us to anticipate that this invariant could 
be viewed as a global signature of the team’s style. This is a first step in the 
search to identify team style of play in a quantitative rather than qualitative way. 
It is a mathematical approach that formally describes a team’s global behavior. 
Future research will focus on finding the link between a team’s signature and its 
behavior. 
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Soccer Spaniels Game 1 




Soccer Spaniels Gams 2 




Pig. 1. Probability histograms with an interval width of 0.0033... of spacings 
between eigenvalues for matrices encoding the movement of the Soccer Spaniels 
Simulated Robot Soccer Team for 3 games. 
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Dynamic Rollers Game 1 




Fig. 2. Probability histograms with an interval width of 0.0033... of spacings 
between eigenvalues for matrices encoding the movement of the Dynamic Rollers 
Simulated Robot Soccer Team for 3 games. 
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Soccer Spaniels Game 4 





Fig. 3. Probability histograms with an interval width of 0.2 of spacings between 
eigenvalues for matrices encoding the movement of the Soccer Spaniels Simulated 
Robot Soccer Team for 5 games. 
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Dynamic Rollers Game 1 






Fig. 4. Probability histograms with an interval width of 0.2 of spacings between 
eigenvalues for matrices encoding the movement of the Dynamic Rollers Simu- 
lated Robot Soccer Team for 5 games. 




438 Victor Korotkich and Noel Patson 



All the consequences and ramifications of this intriguing discovery are not 
the subject of this paper but it is believed that finding the relationship between 
the signature and a team’s dynamics will be very useful in developing strategies 
for the home team. This is important because understanding of the opponent 
team’s dynamics will enable a team to counteract the opponent’s strategies and 
hopefully estimate their future positions. It may also identify weaknesses in the 
home team’s coordination so that fine-tuning may be done. It is believed that 
the use of this signature may help the team win. 
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Abstract. It is envisaged that computers of the future will have smeirt 
interfaces such as speech and vision, which will facilitate natural and 
easy human-machine interaction. Gestures of the face and hands could 
become a naturad way to control the operations of a computer or a mar 
chine, such as a robot. In this paper, we present a vision-based interface 
that in real-time tracks a person’s facial features and the geize point of the 
eyes. The system can robustly trawik f^u:ial features, can detect tracking 
failures and has an automatic mechanism for error recovery. The sys- 
tem is insensitive to lighting chzinges and occulsions or distortion of the 
facial features. The system is user independent and can automatically 
calibrate for each different user. An application using this technology 
for driver fatigue detection and the evaluation of ergonomic design of 
motor vehicles has been developed. Our human-meichine interface has 
an enormous potential in other applications that allow the control of 
machines aind processes, and measure human performance. For example, 
product possibilities exist for assisting the disabled and in video game 
entertainment. 



1 Introduction 

Throughout the era of computer vision research, much effort has been undertaken 
to improve the interface between humans and machines. [1] [2]. An important 
area of this research has been the recognition of gestures of the head and body, 
as well as expressions of the face. Gestures are regarded as the most natural 
forms of human expression. Particularly for disabled and totally inexperienced 
computer users, a gesture interface would open the door to many applications 
ranging from control of machines to “helping hands” for the elderly and disabled. 
A smart visual human-machine interface should recognise facial gestures such as 
“yes” or “no” , as well as being able to determine the users gaze point i.e. the 
direction in which the person is looking and the focus attention. The ability to 
estimate a person’s gaze point is very important. For example a robot assisting 
the disabled may need to pick up items that attract the user’s gaze. The crucial 
aspect of such systems is the capability to process data in real-time. However, at 
present many computer vision systems still suffer from a lack of computational 
speed. In our research we describe a vision system that is capable of real-time 
feature tracking. 
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The goal of our research is to build a visual human-machine interface based 
on a real-time stereo camera system, that works in natural scenes to track the 
human face as well as estimate the user’s gaze direction. It is our belief that in 
the near future, vision-based interfaces will revolutionise the way people work 
with machines, and will open up a tremendous range of new applications. 

Innovative applications of the technology for the automotive industry have 
been the a primary focus of our research. Driver Behaviour Analysis and Driver 
Safety Systems are the prime targets. Car manufacturers also need to analyse the 
ergonomic efficiencies of their motor vehicle designs. Where drivers look and how 
they behave in vehicles equipped with various configurations of instrumentation 
is of great importance. Our system completely automates the data capture and 
replay phase of driver behaviour analysis. We determine the instrument being 
observed using an internal computer model of the vehicle and the driver’s gaze 
point as a reference. The duration and timing of the driver’s gaze are auto- 
matically measured and logged. The impact on Driver-Safety of our system is 
apparent. It can detect whether the driver is not looking at the road or has fallen 
asleep. We have built, tested and successfully demonstrated a in-car prototype. 



2 Visual Interfaces 

2.1 Face and Gaze Detection 

There are several types of commercial products in existence to detect head posi- 
tion and orientation, using magnetic sensors and link mechanisms. There are also 
several companies supporting products that perform eye gaze tracking. These 
products are generally highly accurate and reliable. However, all these prod- 
ucts require either expensive hardware and/or artificial environments (helmets, 
infrared lighting, marking on the face etc). The restricted motion and discom- 
fort to user caused by such equipment makes it difficult to measure natural and 
unihibited behavior of people. 

To solve this problem, there have been many research results reported that 
are related to the visual detection of head pose [3, 4, 5, 6]. Recent advances in 
computer hardware have allowed researchers to develop real-time face tracking 
systems. However all of the previously reported systems are based on monocular 
vision. Recovering the 3D pose from a monocular image stream is regarded as 
a difficult problem in general. High accuracy as well as robustness are particu- 
larly hard to achieve. Most reported systems do not compute the full 3D 6DOF 
posture of heads. Researchers have developed monocular systems to detect both 
the head pose and gaze point simultaneously [7,8], However, these systems do 
not accurately determine the 3D vector of the gaze direction. 

In order to construct a system which observes a person without causing any 
discomfort, the system should satisfy the following requirements: non-contact, 
passive, real-time, robust to occlusions and lighting change, compact, accurate, 
and capable to detect head posture and gaze direction simultaneously. 
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Stereo Ceunera 




Fig. 1. System configuration of the human-machine interface. 



Our system simultaneously satisfies all the listed requirements by utilizing 
the following techniques: stereo vision using field multiplexing, image processing 
using normalized correlation and 3D model fitting using virtual springs. 

3 Real-Time Vision Hardware 

Figure 1 illustrates the hardware setup of our real-time stereo face tracking 
system. It has a NTSC camera pair (SONY EVI-370DG x 2) to capture the 
person’s face. The output video signals from the cameras are multiplexed into 
one video signal using the “field multiplexing technique” [9] . The multiplexed 
video stream is then fed into a vision processing board (Hitach IP 5000), where 
the position and the orientation of the face are determined. The face tracking 
results are visualized on a SGI 02 graphics workstation. 



3.1 Hitachi IP5000 Image Processor 

The Hitachi IP5000 PCI half-sized image processing board is used in this re- 
search. The card is equipped with 40 frame memories of 512 x 512 pixels. It 
provides in hardware a wide variety of fast image processing functions such as 
binarization, convolution, filtering, labeling, histogram calculation, color extrac- 
tion and normalized correlation. The frequency of these operations is 73.5MHz, 
which means the card can apply a single basic function (such as binarization) to 
a single image in 3.6ms. 



3.2 Field Multiplexing Device 

The field multiplexing is a method for generating an analog multiplexed video 
steam from two video streams. A diagram of the device is shown in Figure 2. 
The device takes two video steams which are synchronized into a video switching 
IC. The video switcher selects one signal and uses it as the odd field of the video 
output; the other signal becomes the even field of the video output. Since the 
frequency of the switching is only 60Hz, the multiplexer can be easily and cheaply 
implemented using only consumer electronic parts. A photo of the device is also 
shown in Figure 2; the size of the device is less than 5cm square. 
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The advantage of multiplexing video signals in the analog phase is that such 
an approach can be applied to any vision system. Single video stream processing 
is transformed into stereo vision processing. Since the multiplexed image is stored 
in a single video frame memory, stereo image processing can be performed within 
the memory. This means there is no overhead cost for transferring images, which 
is inevitable in stereo vision system with two image processing boards. Thus a 
system with a field multiplexing device can have a higher performance than a 
system with two boards. 

A minor weak point of the field multiplexing is that the image looks strange 
to human eyes if you display the signal directly on a TV monitor, because two 
images are superimposed every two lines. However this doesn’t make image pro- 
cessing any harder, since a normal image can be easily obtained by subsampling 
the multiplexed image in the vertical direction. 



Field Multiplexing Device 





Fig. 2. Field Multiplexing Device. 



4 Stereo Face Tracking 

4.1 3D Facial Model 

The 3D facial model used in our stereo face tracking is composed of three com- 
ponents: 

— template images of the facial features, 

— 3D coordinates of the facial features, 

— an image of the entire face. 
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The facial features are defined as both of the corners of the eyes and the 
mouth. Thus there are six feature images and coordinates in a facial model, an 
example of which is shown in Figure 3. The facial model also has an image of 
the whole face stored in low resolution. This image is used to search for a face 
at the system initialisation stage and in cases when the system feature tracking 
fails. 

The facial model can be acquired either “automatically” or “manually” . In 
automatic acquisition mode, the eyes and mouth are detected by first finding 
skin colored regions in the image and then binarizing the intensity information 
contained in the skin colored facial region. Small image patches at both ends 
of the extracted eyes and mouth are memorized as feature templates, and the 
3D coordinates of the features are calculated based on stereo matching. In the 
manual acquisition mode, the image patches of the features are selected by simply 
clicking with a mouse over positions of interest in the image. Stereo matching is 
then performed to calculate the 3D coordinates. 




Feature templates 




Feature coordinates 
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Fig. 3. Upper; Extracted tracking features from stereo images. Lower: 3D facial 
model. 
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4.2 Stereo Tracking Algorithm 

Before face tracking begins, the error recovery procedure is executed to determine 
the approximate position of the face in the live video stream using the whole 
face image. 

Feature tracking and stereo matching for each feature are carried out to 
determine the 3D positions of each feature. A 3D facial model is fitted to the 3D 
measurements, and the 3D position and orientation of the face is estimated in 
terms of six parameters. Then the 3D coordinates for each feature are adjusted to 
maintain the consistency of the rigid body facial model. Finally, the 3D feature 
coordinates are projected back onto the 2D image plane in order to update the 
search area for feature tracking by the vision processor in the next frame. 

At the end of each tracking process cycle, the overall reliability of the face 
tracking is determined using the correlation values of feature tracking and stereo 
matching. If the reliability is higher than a preset threshold, the system returns 
to the beginning of the tracking process again. Otherwise the system decides it 
has lost the face and jumps back to the error recovery phase. 



3D Feature Tracking In the 3D feature tracking stage, it is assumed that 
each feature has only a small displacement between the current frame and the 
previous one. The 2D positions of features in the previous frame are used to 
determine the search area in the current frame. The feature images stored with 
the 3D facial model are used as templates. Images from the right camera are 
searched for features. The 2D features that have been found in the right image 
are used as templates for searching for matching images from the left camera. By 
stereo matching the 3D coordinates of each feature are acquired. The processing 
time of the whole tracking process (i.e. feature tracking + stereo matching for 
six features) is about 10ms by the IP5000. 




3D Model Fitting Figure 4 illustrates the coordinate system used to represent 
the position and the orientation of the face. The parameters {<j), 9, (p) represents 
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Fig. 5. 3D model fitting algorithm. 



the orientation of the face, and (x,y,z) represents the position of the face center 
relative to the origin of the camera axis. 

The diagrams in Figure 5 describe the model fitting scheme used in our sys- 
tem. In the actual implementation six features are used for tracking, however 
only three points are illustrated in the diagrams for simplicity. The basic idea of 
the model fitting is to iteratively move the model closer to the system measure- 






Visual Human- Machine Interaction 447 



ments while considering the reliability results of feature tracking. As mentioned 
before, the face is assumed to have a small motion between the frames. This 
means there can be only small displacements in terms of the position and the 
orientation, which is described as {Ax.,Ay,Az,A(j>, A6 , A^p) in Figure 5(1). 

The position and the orientation determined in the previous frame (at time t) 
are used to rotate and translate the data set of measurements from the vision 
system to the same coordinate space as the model, as shown in Figure 5(2). After 
the rotation and translation, the data set of measurements have a small disparity 
from the model due to the motion which has occurred during the interval At. Fine 
fitting of the model is performed next. To realize a robust fitting of the model, 
it is essential to take the reliability values of the individual measurements into 
account. The least squares method is usually adopted for such purposes. In our 
system, a similar fitting approach based on virtual springs is used. The result of 
3D tracking yields two correlation values (for the left and right images) which 
are between 0 and 1 for each feature. If a template and another matching region 
have exactly the same pattern, then the resulting image correlation value is 1. A 
value of 0 represents correlation where all pixels in the two correlation images are 
completely different. The product of the two correlation values for each feature 
can be regarded as a reliability value. The reliability values are used as the 
parameters of stiffness of springs that link eeich feature in the model. The spring 
based model fiiting is shown in Figure 5(3). The model is iteratively rotated and 
translated in order to reduce the elastic energy of the springs. Using the tracking 
reliability as a spring constant makes the results of model fitting insensitive to 
the partial matching failures, and ensures robust face tracking. The processing 
time of the iterative model fitting takes less than 2ms using a Pentiumll 450MHz 
processor. 



4.3 Error Recovery 

The tracking method described above has only a small search area in the image, 
which enables the real-time processing and continuous stable result of tracking. 
However once the system fails to trsick the face, it is hard for the system to 
make a recovery by using only the local template matching, and a complemen- 
tary method for finding the face in the image is necessary as an error recovery 
function. This process is also used at the beginning of the tracking when the 
position of the face is unknown. 

The whole face image shown in Figure 3 is used in this process. In order to 
reduce the processing time, the template is memorized in low resolution. The 
live video stream is also reduced in resolution. The template is first searched in 
the right image, and then the matched image is searched in the left image. As a 
result, the rough 3D position of the face is determined and is then used as the 
initial state of the face for the face tracking. This searching process takes about 
100ms. 
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5 Implementation Results 

5.1 Face Tracking 

Some snapshots of the results of the real-time face tracking system are shown 
in Figure 6. Images (1) and (2) in Figure 6a show results when the face rotates, 
while (3) and (4) show results when the face moves closer to and further from 
the camera. The whole tracking process takes about 30ms which is within a 
NTSC video frame rate. The accuracy of the tracking is approximately ±lmm 
in translation and ±1° in rotation. 

The snapshots in Figure 6b show the results of tracking when there is some 
deformation of the facial features and partial occlusions of the face by a hand. 
The results indicate our tracking system works quite robustly in such situations 
owing to our model fitting method. By utilizing the normalized correlation func- 
tion on the IP5000, the tracking system is tolerant of the fluctuation in lighting. 




Fig. 6. Face tracking results. 



5.2 Visualization 

The results of the tracking are visualized using a SGI 02 graphics workstation. 
Figure 7a illustrates examples of the tracking results and the corresponding 
visualization. The 3D model used in the visualization consists of the rigid surface 
of the face and two eyeballs. The face has six DOF for position and orientation, 
and the eyeballs have two DOF respectively. The external border of the irises of 
the eyes is detected using the Circular Hough Transform, and then are used to 
position the eyes of the mannequin head. The visualization process is performed 
online during the tracking, and therefore the mannequin head can mimic the 
person’s head and eye motions in real-time. 
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Fig. 7. Visualization of Face Tracking and Gaze Estimation. 



5.3 Gaze Detection 

Our system outputs the position and posture of the face, as well as the position 
of the pupils. Pupil detection is done using the Circular Hough Transform. Gaze 
point estimation is done by fusing the 3D face position and posture with the 
2D positions of the pupils. The eyeball is modelled as a sphere, and using the 
corners of the eye the center of the eyeball can be determined. The eyeball center 
is connected to the detected pupil’s center to form a gaze line for each eye. The 
gaze lines are then projected into 3D space, and the intersection point becomes 
the estimated gaze point. Figure 7b shows a stereo image pair of a face, and 
the corresponding mannequin model shows the estimated face pose and gaze 
direction. Pupil detection takes about 2ms. The combined face tracking and 
gaze detection system runs at 20Hz, 

6 Applications 

Our system has been developed for use as a general vision-based human-machine 
interface. We believe our interface can be applied to an enormous range of real 
world applications. For example our system could be used as a measurement tool 
in psychological experiments to determine where a human subject is looking. A 
commercial application of gaze measurement lies in advertising. Marketing firms 
could evaluate the effectiveness of television commercials by monitoring gaze di- 
rection while subjects are watching the TV. Another possible area of application 
lies in the entertainment industry. Video games could be controlled simply by 
a user’s gaze direction. Likewise disabled people could steer wheelchairs using 
gaze, or command robots to pick up items merely by looking at the object. 

An important application area lies in evaluating the ergonomic design effec- 
tiveness of products and machines that people use. For example the operator 
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Fig. 8. Driver Performance Measurement System. 



control panel in a nuclear plant should be designed and laid out in a manner 
that does not induce any confusion that causes the operator to press the wrong 
button. A non-intrusive means of observing the operator working is needed. In 
the case of a poor ergonomic design the operator’s gaze will wander unnecessar- 
ily. In this application domain we have developed a system that measures the 
ergonomic design efficiency of its motor vehicles in driving tasks. We have built 
an in-car version of our system. Figure 8 shows the in-car installation. The figure 
shows the stereo cameras mounted in the dash, a close up of the cameras, the 
driver, the display monitor mounted in the rear seat, and the computer equip- 
ment in the rear compartment. We can measure the gaze direction and duration 
while a person is driving. For example the system can log how often the driver 
glances at the speedometer and side mirrors. 

The system has been extensively tested with a variety of drivers in a wide 
range of driving conditions. It is able to work reliably over 85% of the time; on 
the occasions when feature tracking does fails our system recovers automatically. 
The onboard computer logs the user’s gaze detection data. This feature allows 
design engineers to analyse the data off-line. The off-line analysis also allows 
users to visualise the gaze detection data using a CAD model of the motor 
vehicle. Figure 9 shows an example of gaze point visualisation inside a 3D CAD 
model of a motor vehicle. 

An obvious extension of our work is to apply the system to Driver-Safety in 
motor vehicles, so that an alarm sounds if a driver falls asleep. Currently, our 
system can detect if a person has fallen asleep (by measuring eye closure) and 
can also determine if a driver has entered a hypnotic state (by measuring the 
blink rate of the eyes). Also, a person may be awake and not looking at the road 
ahead, and should be warned. 



Visual HumEin-Machine Interaction 451 




Fig. 9. Driver Performance Visualisation. 



7 Conclusions 

In this paper, we presented a vision-based human-machine interface. The inter- 
face is able in real-time reliably, accurately and robustly measure the 3D position 
and orientation of the face, as well as the user’s gaze direction. The system is 
non-intrusive and passive, thereby making it a natural interface. Presently, this 
interface is the most advanced of its kind. In the course of this research project, 
we solved the longstanding problems of processing video images sufficiently fast 
enough to achieve reliable results, coping with different users and varying lighting 
conditions, as well as developing automatic error recovery mechanisms. 

We believe that in the near future, vision-based interfaces will revolutionise 
the way people work with machines, and will open up a huge range of new appli- 
cations. We have demonstrated the usefulness of the interface in two applications 
for motor vehicles. The interface could also be applied to measuring human per- 
formance in psychological experiments, ergonomic design, as well as products for 
the disabled and in video games for the entertainment industry. 
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Abstract In previous work solutions for the nesting problem are produced 
using the no fit polygon (NFP), a new evaluation method and three evolutionary 
algorithms (simulated annealing (SA), tabu search (TS) and genetic algorithms 
(GA)). Tabu search has been shown to produce the best quality solutions for 
two problems. In this paper this work is developed. A relatively new type of 
search algorithm (ant algorithm) is developed and the results from this 
algorithm are compared against SA, TS and GA We discuss the ideas behind 
ant algorithms and describe how they have been implemented with regards to 
the nesting problem. The evaluation method used is described, as is the NFP. 
Computational results are given. 

Keywords. Genetic Algorithm, Search, Ant Agorithms, No Fit Polygon, 
Simulated Annealing 



1 Introduction 

In the nesting problem it is necessary to place a number of shapes onto a larger 
shape. In doing so the shapes must not overlap and they must stay within the confines 
of the larger shape. The usual objective is to minimise the waste of the larger shape. 
Only two dimensions, height and width, of the shapes are considered and the larger 
piece is sometimes considered to be of infinite height so that only the width of the 
placements need be checked. This is a realistic assumption for the real world as the 
larger shapes are sometimes rolls of material which can be considered as being of 
infinite length for the purposes of the placement procedure. In this paper a number of 
assumptions are made. The height of the bin (the larger piece) is considered infinite, 
although it remains the aim of the evaluation function to minimise this height. Only 
convex polygons are considered. Only one bin is used (that is, there is no concept of 
filling a bin and having to start another). Only guillotine cuts are allowed (that is, a 
cut must be made from one edge to the other). In future research, once we have shown 
the method to be effective, we will relax these constraints. 

Ant Algorithms (described in section 4) are a relatively new search mechanism, 
having been introduced by Marco Dorigo in his PhD thesis [16] and in [9, 10]. 
Initially the algorithm was applied to the Travelling Salesman Problem (TSP) [19]. 

N. Foo (Ed.): AI'99, LNAI 1747, pp. 453-464, 1999. 
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In [25], when ant algorithms are compared against an iterated local search (ILS) 
algorithm, which is known to be among the best algorithms for the TSP, the ant 
algorithm finds a better average solution quality than ILS. In recent years, ant 
algorithms have also been applied to other problems. A recent paper [23] compares 
the results of an ant algorithm using test problems from QALIB. The ant algorithm 
was able to beat GRASP (greedy rand omiz ed adaptive search procedure) in all cases. 
The algorithm was also applied to a real world case and found a solution that was 
22.5% better then the current one (althoug)i it is noted that this figure must be read 
with caution as other factors have been included in the real world planning). Ant 
Algorithms have also been applied to Vehicle Routing Problems (VRP) [4]. In [4] the 
algorithm could not improve on published results but it is seen as a viable alternative 
when tackling VRP’s. Scheduling [11, 21], Graph Colouring [12], Partitioning 
Problems [22] and Telecommunication Networks [15] have also been addressed by 
ant algoritluns. An overview of ant algorithms can be found in [17, 18]. 

We are not aware of any published work that applies ant algorithms to the nesting 
problem. The only work we have seen is a presentation at Optimization 98, University 
of Coimbra, Portugal (10-22 July 1998), which used an ant algorithm to place shapes. 
Unfortunately the conference did not publish proceedings. 

In [3] the no fit polygon (NFP) is introduced (described in section 2). The NFP is 
also used in [1] to calculate the minimum enclosing rectangle for two irregular 
shapes. Albano [2] used a heuristic search method to place irregular pieces onto stock 
sheets. The search is based on the A* algorithm but some restrictions are introduced 
due to the size of the search space. For example the heuristic function is not 
admissible. Their evaluation function, f(n), is given by g(n) + h(n) where g(n) is the 
cost of the solution so far, which is a measure of the waste that cannot be used in later 
placements. h(n) is a heuristic measure that estimates the amount of waste in the 
optimal solution should the current piece be included in the placement. If it were 
possible to find suitable values for h(n) then an optimal solution could be found. 
However, this is not possible so only “good” solutions can be found. The Albano and 
Sapuppo approach makes use of the no fit polygon when deciding where pieces 
should be placed. More recently [24] also used the no fit polygon. Pieces are placed 
onto the stock sheet one at a time. The location of the next piece is calculated using 
the no fit polygon. Once the best placement has been found the piece is added to the 
partial solution and the next piece is placed. Oliveira experimented with three 
evaluation functions when considering the placement of the next piece. 

This paper presents a method to produce solutions to the nesting problem. We 
employ an ant algorithm and compare our results with other evolutionary and meta- 
heuristic algorithms that have been applied to the same problem [5, 6]. 



2 No Fit Polygon 

The no fit polygon (NFP) determines all arrangements that two arbitrary polygons 
may take such that the polygons touch but do not overlap. If we can find the NFP for 
two given polygons then we know we cannot move the polygons closer together in 
order to obtain a tighter packing. In order to find the placements we proceed as 
follows (see fig. 1). One of the polygons (Pi) remains stationary. The other polygon. 
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P2, orbits around Pi, staying in contact with it but never intersecting it. Both polygons 
retain their original orientation. That is, they never rotate. As P2 moves around Pi one 
of its vertices (the reference point - shown as a filled circle) traces a line (this 
becomes the NFP). 




Fig. 1 shows the starting (and finishing) positions of Pi and P2. The NFP is shown 
as a dashed line. It is slightly enlarged so that it is visible. In fact, some of the edges 
tvill be identical to some of the edges of Pi and P2. Once we have calculated the NFP 
for a given pair of polygons we can place the reference point of P2 anywhere on an 
edge or vertex of the NFP in the knowledge that P2 will touch but not intersect Pi. In 
order to implement a NFP algorithm it is not necessary to emulate one polygon 
orbiting another. [ 13 ] and [ 14 ] present an algorithm that works on the assumption that 
(for convex polygons only) the NFP has its number of edges equal to the number of 
edges of Pi and P2. In addition, the edges of the NFP are simply copies of the edges of 
Pi and P2, suitably ordered. To build the NFP it is a matter of taking the edges of Pi 
and P2, sorting them and building the NFP using the ordered edges. 



3 Evaluation 



3.1 Placing the Polygons 

In order to fill the bin we proceed as follows. The first polygon to be placed is 
chosen and becomes the stationary polygon (Pi in the example above). The next 
polygon (P2 from the above example) becomes the orbiting polygon. Using these two 
polygons the NFP is constructed. The reference point of P2 is placed on each vertex of 
the NFP and for each position the convex hull for the two polygons is calculated. 
Once all placements have been considered the convex hull that has the minimum area 
is returned as the best packing of the two polygons. This larger polygon now becomes 
the stationary polygon and the next polygon is used as the orbiting polygon. This 
process is repeated until all polygons have been processed. As each large polygon (i.e. 
output from the convex hull operation) is created, its width is checked. If this exceeds 
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the width of the bin, then a new row within the bin is started. In this case the polygon 
which forced the width of the bin to be exceeded becomes the stationary polygon. 
That is, the large polygon built thus far forms one row and the next row is constructed 
using a single polygon as a starting point. 

There are a number of points that need to be considered when using the above 
evaluation method. For reasons of space, a complete discussion of these points cannot 
be provided but they are detailed in [7], with the main points summarised below 

A number of computational geometry algorithms have been implemented in order 
to effectively manipulate polygons. Executing computational geometry algorithms is 
computationally expensive. This makes the evaluation function the slowest part of the 
algorithm. In view of this a cache has been implemented which retains n (user 
defined) previous evaluations. If a solution (partial or complete) is in the cache then 
the results are retrieved from the cache rather than executing the evaluation function. 
To further improve the algorithm the concept of a polygon type has been introduced. 
This allows polygons which are identical to be classified together so that the cache 
can work at a higher level of abstraction when deciding if a partial or complete 
solution is in the cache. Once the NFP has been calculated there may be a number of 
optimal placements for the two polygons being evaluated (that is, the convex hulls 
have equal area). In this case, the convex hull to use is selected at random. However, 
all the optimal solutions are stored in the cache so that if the solution (partial or 
complete) is seen again, one of the other convex hulls could be selected. 

If n polygons have been evaluated the solutions in the cache for the polygon 
will be influenced by earlier evaluations. However, even though the convex hulls with 
the lowest area has been chosen earlier in the evaluation, this may not be the best 
choice once other polygons are added to the solution. Therefore, it may be beneficial 
to reevaluate a complete solution even if it is stored in the cache. In order to 
accommodate this a reevaluation parameter was introduced which, with some 
probability, forces a solution to be re-evaluated even if it is in the cache. 



3.2. Evaluating the Placement 

Our cost function is based on that used in [20]. It can be stated as follows 

n 

(^((1 - (UsedRowArealTotalRowAreayf') * k)/n (1) 

Where UsedRowArea is the total area of the polygons placed in that row 
TotalRowArea is the total area of the bin occupied by that row 
k if a factor simply to scale the result. We used 100 but 1 could be used 
n is the number of rows in the bin 

In words, we are trying to minimise the area used by each row. 

This evaluation function is preferable to the more obvious method of simply 
measuring the bin height as, using the bin height, many solutions will map to the same 
evaluation function. This makes it much more difficult to effectively explore the 
search space. In previous work, we have compared these two different types of 
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evaluation and we find that the above function (1) produces better quality solutions; 
which is the same experience as reported in [20]. 



4 Ant Algorithms 

4.1 Ant Algorithms and the TSP 

Ant algorithms are based on the real world phenomena that ants, despite being 
almost blind, are able to find their way to a food source and back to their nest, using 
the shortest route. In [19] this phenomena is discussed by considering what happens 
when an ant comes across an obstacle and it has to decide the best route to take 
around the obstacle. Initially, there is equal probability as to which way the ant will 
turn in order to negotiate the obstacle. If we assume that one route around the obstacle 
is shorter than the alternative route then the ants taking the shorter route will arrive at 
a point on the other side of the obstacle before the ants which take the longer route. If 
we now consider other ants coming in the opposite direction, when they come across 
the same obstacle they are also faced with the same decision as to which way to turn. 
However, as ants walk they deposit a pheromone trail. The ants that have already 
taken the shorter route will have laid a trail on this route so ants arriving at the 
obstacle from the other direction are more likely to follow that route as it has a 
deposit of pheromone. Over a period of time, the shortest route will have high levels 
of pheromone so that all ants are more likely to follow this route. This form of 
behaviour is known autocatalytic behaviour. There is positive feedback which 
reinforces that behaviour so that the more ants that follow a particular route, the more 
desirable it becomes. 

To convert this idea to a search mechanism for the Travelling Salesman Problem 
(TSP) there are a number of factors to consider. Below is a summary of [19]. 

At the start of the algorithm one ant is placed in each city. Time, t, is discrete. t(0) 
marks the start of the algorithm. At t+1 every ant will have moved to a new city and 
the parameters controlling the algorithm will have been updated. Assuming that the 
TSP is being represented as a fully connected graph, each edge has an intensity of 
trail on it. This represents the pheromone trail laid by the ants. Let Ty(t) represent the 
intensity of trail edge (i,j) at time t. When an ant decides which town to move to next, 
it does so with a probability that is based on the distance to that city and the amount 
of trail intensity on the connecting edge. The distance to the next town, known as the 
visibility, /ly, is defined as 1/dy, where, dy, is the distance between cities i and j. 

At each time unit evaporation takes place. This is to stop the intensity trails 
building up unbounded. The amount of evaporation, p, is a value between 0 and 1. 

In order to stop ants visiting the same city in the same tour a data structure. Tabu, 
is maintained. This prevents ants visiting cities they have previously visited. TabUk is 
defined as the list for the k'*’ ant which holds the cities that have already been visited. 

After each ant tour the trail intensity on each edge is updated using the following 
formula 



Ty (t + n) = p . Ty(t) + ATy 



(2) 
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where 




Q if the kth ant uses edge(i, j) in its tour 
Lk {between time t and t*n) 

0 otherwise 



( 3 ) 



This represents the trail substance laid on edge (i, j) by the k* ant between time t 
and t+n. Q is a constant and is the tour length of the k“ ant. Finally, we define the 
transition probability that the k'*' ant will move from city i to city j. 




[Ut)Y ,{mY 

^ i G allowedk [7^(0]“ . [n*]^ 



0 



if j G allowedk 



otherwise 



( 4 ) 



where a and p are control parameters that control the relative importance of trail 
versus visibility. 



4.2 Ant Algorithms and the Nesting Problem 

Using the TSP ant system as a model, an ant system has been developed for the 
nesting problem using the no Et polygon and the evaluation method described in 
section 3. Each polygon can be viewed as a city in the TSP and these are fully 
connected so that there is an edge between each polygon and every other. An ant is 
placed at each city (polygon) and using the trail and visibility values (formula 4) the 
ant decides which polygon should be visited (placed) next. Once an ant has placed all 
the polygons the edge trail values are updated. The edge trail values are calculated 
using the value returned from evaluating the nesting. This is equivalent to using the 
tour length in the TSP (Lt in formula 3). 

Using the method described above we can use very similar formulae to those 
described in [19] (and shown above (2, 3, 4)), with some minor amendments. 
Visibility is now defined as how the polygon, just placed, fits with the piece about to 
be placed. For example, two rectangles of the same dimensions would fit together 
with no waste so the visibility would be high. Two irregular shapes, when placed 
together, may result in high wastage. This would result in a low visibility value. In 
order to calculate the visibility, the combined area of the two pieces is divided by the 
best placement of the two shapes (using the NFP). That is 



ny = TotalAreaij / BestPlacementij (5) 

Where i is the polygon just placed and j is the polygon about to be placed. This 
returns a value between 0 and 1. In order to improve the speed of the algorithm all the 
visibility values are calculated at the start of Ae algorithm and held in a cache. The 
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transition probability is defined as the probability of a polygon being placed next 
taking into account the polygons placed so fer and the visibility of the next polygon. 
The same formula as 4, above, can be used to calculate the transition probability. 



5 Testing and Results 

5.1 Test Data 

Two problems were used in our testing. The packing shown in fig. 2 is fi’om [8]. 
The reason that this data was chosen is because it consists of convex polygons and the 
optimum is known. The only change made is to multiply the measurements in the 
original paper by a factor of two. This assists us when displaying results. 




Fig. 2. Test Data 1 Fig. 3. Test Data 2 

Our algorithms will not be able to find the optimum. The first two rows (A, B, C 
and D, E, F) can be constructed without problems. However, to find the optimum for 
the third row the polygons, would need to be presented in the order of G, H, I, J, K, L 
and M. The optimal solution could be built until the last polygon (M) came to be 
placed. At this time, due to the convex nature of the large polygon that had been built, 
the final polygon will not be placed in the position shown. In fact, the final polygon 
would be placed on a new row. Under these circumstances the total bin height would 
be 188 (as opposed to the optimal height of 140). Therefore, the permutation 
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ABCDEFGHUKLM produces a bin height of 188. We accept that we will never 
achieve a value of 140 but would like to get as close as possible to this figure. 

The second problem (fig. 3) is taken from the real world. A company has to cut 
polycarbonate shapes from larger stock sheets. The solution shown (fig. 3) is one 
stock sheet on a given day. Like the first problem, this is constrained by the convex 
properties of the algorithm. For example, the bottom four shapes cannot be packed in 
the way they have been as once three of the shapes have been packed, the fourth 
shape cannot be placed in the position shown. The same is true for the four shapes 
above (excluding the rectangle). The height of the stock sheet is 23940 units. Its width 
is 6240 units. Due to the convex properties of the algorithm it is not possible to 
achieve the solution shown but our main aim is to see if ant algorithms can match or 
beat the solutions we have already obtained. 



5.2 Results 

In this section we compare the ant algorithm with the results we have achieved 
using a genetic algorithm, tabu search and simulated annealing [5, 6]. All results are 
averaged over ten runs. The number of evaluations performed are equivalent in all 
cases. This leads to runtimes that are approximately equivalent (about 300 seconds on 
a Cyrix 166 processor) so that the results are compared fairly. The results show the 
evaluation value as well as the bin height. Althou^ the bin height is not part of the 
evaluation, it is an important measure with regard to the quality of the solution and is 
therefore worth recording. 

Initially we attempted to find suitable values for the parameters that control the ant 
algorithm. This aspect of advanced search remains an art, rather than a science and, in 
order to find suitable values we carried out several hundred runs simply setting the 
parameter values at random. We used these results, along with the best parameters 
found by Dorigo [19], to conduct more selective testing. Dorigo reported that the 
value of Q (the constant in formula 3) had little effect on the algorithm. We 
experimented with various values, {1, 10, 100, 1000}, and reached a similar 
conclusion. Therefore in the remainder of our tests Q = 100. In order to find a good 
value for the evaporation parameter, p G {0.1, 0.5, 0.9} was tested, using a trail 
importance, a = 1 and a visibility importance, p G (0, 1, 2, ..., 30}. These tests were 
carried out on test data 1. The results from these tests are shown in fig. 4. 

All tests in fig. 4 show the highest evaluation when p = 0. This is expected as when 
P = 0 the search is effectively transformed into randomised greedy search with 
multiple starting points. All three runs also show, in the early stages, a downward 
trend as the visibility parameter increases. With p = 0.1 the evaluation values are 
generally higher than when p is 0.5 or 0.9. p = 0.5 performs better than when p = 0.9, 
at least in the early stages of the algorithm (until p = 19). In the latter stages, when the 
visibility is high, the graph has either flattened or is showing an upward trend for all 
values of p. Again, this would be expected as having too high a visibility starts 
returning the algorithm to a greedy search. This is due to the effect of the intensity 
trail becoming diminished, p = 0.5 as a good value agrees with the results in [19] in 
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Fig. 4. Test Data 1, a = 1, p = {0.1, 0.5, 0.9}, p = {0, 1 30} 




Visibility 



Fig. 5. Test Data 1, a = 5, p = 0.5, P = {0, 1, . . ., 20} 



which it was reported that this was the best value found for p. In [19] the best value 
for a was found to be 1 (which is the value used above). In order to see if our 
algorithm agrees with this a test was carried that that set p ~ 0.5, a = 5 and p = 
{0,1,..., 20}. Fig. 5 shows that this set of parameters produces worse results than 
when a = 1. None of the tests produce an evaluation below 600, which was 
consistently done when a = 1 (fig. 4) 
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Fig. 6. Test Data 2, a = 1, p = {0.1, 0.5}, p = {0, 1, . . ., 30} 

Having established a good parameter set we used test data 2 to confirm these 
values. Fig. 6 shows two runs that compare the effect of p (evaporation) on test data 2. 
In fact the two runs mimic each other closely but it is interesting to note that the 
lowest evaluation for p = 0.5 is when visibility is around 20. This matches the result 
from the first set of test data. This test appears to confirm that p = 0.5 is a good choice 
and we used this in the remaining tests. 




Visibility 



Fig. 7. Test Data 2, a = (1, 5}, p = 0.5, p = {0, 1 30} 



Fig. 7 shows the effect of trail importance, a = (1, 5} using test data 2. It shows 
that a higher value of a leads to inferior solutions. Again, this confirms the results 
from the first set of test data. In summary, the best results were achieved on both sets 
of test data when a = 1, p = 0.5, p = 20. The best results we have achieved using other 
search algorithms [5, 6] and the modified Falkenauer function are shown in table 1. 
We also show the best results from the ant algorithm using the parameters just 
described. 
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Table 1. Results using the same test data with different search algorithms 



Search Method 


Test Data 1 

Evaluation Bin Height 


Test Data 2 

Evaluation Bin Height 


Genetic Algorithm 


516.41 


179.00 


1377.99 


30478.50 


Tabu Search 


323.77 


165.80 


1006.75 


27566.40 


Simulated Annealing 


397.01 


170.80 


1661.77 


33052.20 


Ant Algorithm 


412.97 


173.20 


1316.85 


29316.60 



Tabu search finds the best quality solutions of the algorithms implemented but the 
ant algorithm compares favourably with simulated annealing, performing significantly 
better with test data 2. The ant algorithm also out performs the other population based 
search method (genetic algorithm). 



6 Summaiy 

This is the first time that ant algorithms have been applied to the nesting problem 
and the results are encouraging. Ant algorithms out perform genetic algorithms and 
provide a viable alternative to simulated annealing, although more work is required 
for it to compete with tabu search. We will continue using ant algorithms in our future 
research when we will tackle more complex problems as well as relaxing some of the 
constraints we outlined above. The parameters to the ant algorithm are critical and we 
plan to carry out more work in this area. In addition we plan to look at hybridsation to 
combine ant algorithms with other search techniques in order to produce even better 
solutions. 
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Abstract. In this paper, we propose an extended Vector Space Model 
(VSM) called Fisheye Matching method, which generates features related 
to the users’ viewpoints based on an electronic dictionauy. We have also 
developed a GUI system employing Fisheye Matching method, which 
can help users order the vast collection of documents in several ways 
employing the users’ viewpoint information extracted by the Fisheye 
Matching method. 



When a user gets some ideas from papers or articles, he may organize his 
thoughts by relating incoming information with knowledge which already exists 
in his mind. This process is getting harder for him in proportion to the volume of 
information which he considers, and it is useful to illustrate a concept structure 
on a paper or on a display, which leads to reducing his confusion. We assert 
that ordering documents while reading is an effective way of dealing with the 
vast collection of documents, and the systems which assist such processes should 
be able to find relations among documents based on users’ viewpoints/interests. 
Prom this point of view, we have proposed an extended VSM called Fisheye 
Matching method [1] to perform vector matching on the vector space sensitive 
to users’ viewpoints. Each feature in the Fisheye Matching method is generated 
as a set of words which belong to the same concept (meaning) in a dictionary^. 
Choosing concepts (features) appropriate for the users’ viewpoints from a dictio- 
nary, a vector space is constructed so that the matching results can be expected 
to be more preferable for them. We have also proposed the algorithm to find 
concepts appropriate for the users’ viewpoints from training sets of documents. 
Furthermore, each concept in a dictionary has a heading information, which 
can be presented explicitly to the users as a kind of their viewpoint informa- 
tion. Some experiments on document retrieval tasks have been performed, and 
Table 1 shows parts of concepts found and used as features during the experi- 
ments. From both this table and its precision property, it was confirmed that the 

^ We have used the EDR electronic dictionaa-y developed by Japan Electronic Dictio- 
nairy Research Institute, Ltd. : http://www.iijnet.or.jp/edr/ 

N. Foo (Ed.): AI’99, LNAI 1747, pp. 465-466, 1999. 
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Table 1. Examples of concepts extracted as features in the case of retrieving 
documents about medical topics 



ID 


Heading Term 


Group Words (stemmed) 


3f98b3 


value health 


diseas sickne health . . . 


444506 


component of living body 


protei immuno choles dna 


30f6da 


internal organs 


eye heart lung knee . . . 


3f969e 


disease 


syndro aids cancer cold . . . 


44479c 


medical supplies 


drug medici laxati acid . . . 


30f6f7 


medical instruments 


bandag cathet glasse . . . 




Fig. 1. Fview: GUI system for document ordering support 



Fisheye Matching method can not only retrieve documents in which the users 
take interest, but also supply them with useful information on their viewpoints. 



We have also developed the GUI system which assists users to order docu- 
ments with the Fisheye Matching method (Fig. 1). By using the Fisheye Match- 
ing method, the system can extract the users’ viewpoint information from the 
diagrams produced by them, which information cannot be used by the system 
only to retrieve documents suited for the users’ current interests, but also to 
indicate the similarity among documents of which they may not be aware. Fur- 
thermore, the users’ viewpoint information are also presented to them as a list 
of heading information of extracted concepts. Several students actually used 
this system, and it heis been confirmed that an effective assist for users can he 
achieved. 
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Abstract. We consider the knowledge base update in which a knowl- 
edge base is realised by a normal logic program and can be updated with 
an inserting or deleting update rule. In this paper we propose a SLDNF 
resolution based procedure to implement this kind of rule based update. 
We prove a correctness theorem for our approach under the stable model 
semantics and show that a minimal change criterion is also satisfied in 
the underlying update formalization. 



1 Introduction 

The knowledge base is realised as a normal logic program. Basically, rule based 
update addresses the following problem: Given an initial knowledge base and an 
update rule, how to update the initial knowledge base such that whenever the 
body of the update rule is achieved in the initial knowledge base, the head and 
the body of the update rule are achieved in the resulting knowledge base. 

2 Definitions and Concepts 

A rule is of form: 



Ao Ai, • • • , Am,notBm+i, • • • , notBn, (1) 

where Aq, • • • , A-m, B^+i, ■■■ ,Bn are atoms of language £. Now we specify a 
Knowledge Base is a normal logic program. An update rule is a rule with one 
of the following two forms: 

Q ^ , Prni * ’ * j notPn, (2) 

notQ <— Pi,---, Pm, notPm+ 1 , ---, notPn- (3) 

Rule (2) is called the inserting update rule, while rule (3) is called the deleting 
update rule. In our formalism, an update is a transformation on knowledge bases. 

N. Foo (Ed.): AI’99, LNAI 1747, pp. 467-468, 1999. 
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3 Formal Descriptions 

It is well known that it is possible that a SLDNF tree may include infinite branch. 
In this case, no result is able to be proved from the SLDNF resolution proof. To 
avoid this problem, in our update context, we assume that during the update, 
each SLDNF tree is finite. We present the main procedures named checking, 
Insert/Delete update. 



Algorithm 1. Checking(P, r) 

Function: Checking the body of an update rule r in logic program V. 

Input: A logic program (knowledge base) V and an update rule r of form (2) 
or (3), where the body of r consists of Pi, • • • , Pm, notPm+i,- ■ ■ , notPn- 
Output: A Boolean value True or False. 

The function of algorithm Checking(P, r) is to check if the body of the update 
rule r is achieved by V. 



Algorithm 2. Insert/ DeleteUpdate(P,r)^ 

Function: Update V with a inserting/deleting update r. 

Input: A logic program V and a update rule r of form (2 ) or( 3). 
Output: An updated logic program V. 



Example 1. Consider a knowledge base P : {A <— B,A <— notB,B <— C,} 
and an inserting update rule ri: C ^ A. Suppose we update V with ri. Firstly, 
by using algorithm Checking(P,r), it is clear that A is derivable from V. So 
Checking(P, r) =True. Then by algorithm InsertUpdate(P, ri), we simply 
obtain a unique resulting knowledge base V = PU{C +— }. Note that the 
body A in ri can still be proved from V'. 

Now we consider to update V with a deleting update rule r 2 '- not A . 
Consider the SLDNF tree for V U {<— A} To make the successful branch fail, 
there are several options according to our algorithm. Firstly, if we choose node 
C” as Ui and “O” as rij+i, then to make this branch fail, we may remove 
fact C +— from V according to the case (1) in Algorithm 2. However, this does 
not achieve our purpose that P" 1/ A, where P" — P' — {C <— }. This is clear 
that even if rule A <— B cannot used to derive A in P", A still can be proved 
fror P" through rule A <— notB. According to Algorithm 2, we have to consider 
other options then. The only way available for us is to choose node A” as 
node ni and node B” as node Then from the case (3) in the algorithm, 
we need to remove rule A <— B from P'. Therefore, after updating V with the 
deleting update rule r 2 , we obtain the resulting knowledge base P": 

{ A ^ notB, B*-C,C^.}. 

This example also shows that it is not always the situation of adding or removing 
simple facts from the initial knowledge base in order to achieve the update. □ 

^ The detailed explanation for this algorithm is referred to our full paper. 
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Abstract. We present a polynomial time algorithm to leairn a rich 
class of logic programs (called one-recursive programs) from positive ex- 
amples alone. This class of programs uses the divide-and-conquer method- 
ology and contains a wide range of programs such as append, reverse, 
merge, split, delete, insertion-sort, preorder and inorder traver- 
sal of binary trees, polynomial recognition, derivatives, sum of 
a list of numbers and allows locail variables. 



Main Results 

Starting from the influential work of Gold [2], a lot of effort has gone into the 
development of a rich theory about inductive inference and the classes of con- 
cepts which can be learned from both positive (examples) and negative data 
(counterexamples), and the classes of concepts which can be learned from posi- 
tive data alone. The study of inferability from positive data alone is important 
because negative examples are hard to obtain in practice. See [3] for further 
discussion and references. 

The existing literature mainly concerns with either nonrecursive programs 
or recursive programs without local variables (variables occurring in the body 
of a clause but not occurring in the head of that clause), usually with a further 
restriction that programs contain a unit clause and at most one recursive clause 
with just one atom in the body. In other words, standard programs for various 
sorting and tree traversal algorithms, which use local variables, are beyond the 
scope of these results. 

As established by many authors in the literature, learning recursive logic pro- 
grams, even with the above restrictions, is a very difficult problem. We approach 
this problem from a programming methodology angle and propose an algorithm 
to learn a class of logic programs, that use divide-and-conquer methodology. Our 
endeavor is to develop an inference algorithm that learns a very natural class of 
programs so that it will be quite useful in practice. We measure the naturality 
of a class of programs in terms of the range of programs it covers from a stan- 
dard Prolog book such as [5]. To summarize, major contributions of the paper 
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include; (1) polynomial time learning algorithm that does not need negative ex- 
amples and does not ask any queries, (2) ability to handle programs with local 
variables enlarging the class of learnable programs (and thereby enlarging the 
scope of applications) and (3) the novel approach to the learning problem from 
the programming methodology point of view. 



One-Recursive Programs 

The divide-and-conquer approach and recursive subterms are the two central 
themes of our class of programs. The predicates defined by these programs are 
recursive on the leftmost argument. The leftmost argument of each recursive call 
invoked by a caller is a recursive subterm of the arguments of the caller. 

Notation; In atom p(s; t), s is the sequence of input terms and t is the sequence 
of output terms. The size of t, denoted by |t|, is the total number of variables, 
constants and functions occurring in it. In the following, builtins is a (possibly 
empty) sequence of atoms with built-in predicates having no output positions. 

Definition 1 A linearly-moded well-typed logic program [3] without mutual 
recursion is one-recursive if each clause in it is of the form 

p{so] to) ^ builtins, p(si;ti), • • • ,p(sfc; tfc) OR 

p(so;to) <r- builtins, p(si; ti), • • • ,p(sfc; tfc),g(s; t) 

such that (a) s^ is same as sq except that the leftmost term in Si is a recursive 
subterm of the leftmost term in sq for each 1 < i < fe, (b) the terms in ti, i > 1 
are distinct variables not occurring in sq, the terms in Sq are variables or one 
of the first two generic-expressions^ of the asserted types and |so| > |to|. 

Due to space limitations, the learning algorithm is not presented. See [4] for it. 

Comparison with Related Works 

The works of Arimura et. al. [1] and Krishna Rao [3] are closely related to ours. 

1. Our results are generalizations of the results in Arimura et. al. [1] and the 
class of context-free transformations (no local variables) considered in [1] is 
a proper subclass of our class of one-recursive programs. 

2. In contrast to the results of Krishna Rao [3], we investigate polynomird 
time learnability from positive data alone (no negative examples). 

^ The terms void and tree(Ti,X,T2) aue the first two generic-expressions of the type 
Binatry — tree. Similarly, [ ] and [H|L] are the first two generic-expressions of the type 
List. The first two generic-expressions of any given recursive type axe unique upto 
veiriable renaming. The recursive subterms o/tree(Ti,X,T2) axe Ti eind T2. Similarly, 
L is the recursive subterm of [H|L]. 
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SQL-TUTOR [1,2] is an ITS for teaching the database language SQL to upper-level 
undergraduate students taking database courses. Students using SQL-TUTOR work 
through a series of problems where the solution is an SQL statement. Although SQL- 
TUTOR does not solve problems, it does have an ideal solution (IS) for each one. A 
correct student solution (SS) to a problem may be the same as the IS although there 
can be more than one correct solution. Figure 1 is an example of a problem in SQL- 
TUTOR, its IS, and an incorrect SS. 



Problem 


Ideal Solution 


Student Solution 


List the titles of all 
movies that have a 
critics rating. 


SELECT title 
FROM movie WHERE 
NOT(critics='NR' ) ; 


SELECT title 
FROM movie WHERE 
critics NOT 'NR'; 



Fig. 1. An SQL problem, its ideal solution, and a student’s incorrect solution. 



SQL-TUTOR models students using Ohlsson’s Constraint-Based Modeling (CBM) 
[3]. CBM proposes the modeling of domains as a set of constraints of the form (Cr, 
Cs). Cr specifies the set of student solutions to which the constraint is relevant, and 
Cs specifies the subset of the relevant student solutions where the constraint is 
satisfied. Each constraint has an associated feedback message that can be displayed if 
the constraint is violated. In figure 1, the student has violated constraint 168 and the 
feedback message is: Make sure NOT is in the right place in the WHERE clause. 

Until recently, problem selection in SQL-TUTOR was based on one simple rule: 
the first problem relevant to the single constraint that the student has most frequently 
violated in the past was selected. In a real classroom, this is an overly simple strategy 
because it was often the case that selected problems were either too complex or too 
simple. Our research has been aimed at improving this situation. 

We propose a new problem selection module based on Bayesian belief networks 
(BBNs) [4]. Our approach involves applying the following two steps to each potential 
next problem p. Firstly, the system predicts, for each constraint c, the potential 
teaching effects of p. The main calculation is of the posterior probability 
P(Performancec,p = VIOLATED), the probability that c will be violated by the student 
should he/she attempt problem p. Constraint violations lead to feedback messages, 
and constraint-specific feedback is the main way that students learn constraints in 
SQL-TUTOR. A BBN for this is depicted in figure 2. RelevantISc,p is the probability 
of c being relevant to p’s IS. The value for this node is always known with certainty. 
RelevantSSc,p is the probability of c being relevant to p’s SS. Masteredc is the 
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probability of the student having mastered the constraint c. Finally, Performancec,p 
predicts the behaviour of the student on constraint c. All the nodes except 
Performancec,p are binary variables. PerformancCcp is a three-valued node taking 
values {SATISFIED, VIOLATED, NOT-RELEVANT}. 




Fig. 2. A Bayesian network for predicting student performance on a single constraint. 

The second step is to summarise the predictions for p over all the constraints c. 
Cunently this is done by counting the number of constraints for w4iich 
P(Performancec,p = VIOLATED) > 0.45. TTiis number. Feedbacks, is then compared 
to the student’s OptimalFeedback. The value of p is (- | Feedbacks - 
OptimalFeedback |). That is, p has a high value if Feedbacks is close to or the same as 
OptimalFeedback. The rationale behind this rule is that if the predicted number of 
feedback messages exceeds OptimalFeedback then the student will be overwhelmed 
with information and the teaching effects of each message will be discounted. On the 
other hand, if the number of feedback messages is less than optimal, then student 
learning will be inefficient and the problems may be too easy. Presently 
OptimalFeedbacks starts at 2 and increases linearly with the competence level of the 
student. 

After the student has submitted his solution, the prior probabilities of mastery for 
each constraint, P{Masteredi), P{Mastered:^... etc, are updated if the constraint was 
relevant to the SS. 

We have performed several off-line experiments using student history logs from 
previous user studies of SQL-TUTOR, comparing problems that were selected by the 
original system against problems that would have been selected by the proposed 
system. In the majority of cases, the new system outperforms the old system. 

Future research will investigate the acquisition of probabilities for the BBNs both 
subjectively (by an expert) and from data. We also plan an on-line evaluation of the 
new system in a user study. 

1. Mitrovic A, 1998. Learning SQL with a Computerized Tutor, Proc. 29th SIGCSE Tech. 
Symposium, 307-311. 
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ASI Series, Vol. 125. Springer- Verlag, Berlin, 167-189. 
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Abstract. Our proposed method enables us to promptly and easily in- 
put Japanese sentences into a smaJl device. All the keys for input are only 
12 keys, which are 0, 1,..., 9, ♦ and #. Therefore, we are able to input 
one Kana character per one keystroke. Furthermore, the system based 
on our method automatically generates the dictionary adapted to the 
target field because the system automatically acquires words by using 
inductive learning. The system is improved by its own learning ability. 



1 Outline 

The procedure for our proposed method consists of translation process, proofread 
process, learning process and feedback process in this order. 

A user inputs the string of numbers corresponding to the pronunciation of 
the intended Japanese sentence by only 12 keys[l][2]. In translation process, the 
input sentence is translated into Kanji-Kana mixed sentence by using the words 
dictionary. The words in the words dictionary are applied in order of the higher 
certainty degree [3] [4]. The certainty degree is based on the situation of the acqui- 
sition of the word, the rate of the correct translation and the appearance degree 
of neighboring characters[5]. If the translation result has errors, proofread pro- 
cess is performed. The user judges whether it is correct or not and proofreads 
it. In learning process, words are extracted by comparing the input sentence 
with its proofread translation result[3][4]. They are compared using their com- 
mon segments. The extracted words are registered into the words dictionary. In 
feedback process, the certainty degree for the word in the words dictionary is 
updated. Thus, this system repeats these processes and improves. 
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2 Evaluation Experiment 

The system based on our proposed method has been developed for the experi- 
ment. The input data for the experiment are some sections of UNIX MANUAL. 
The number of characters is 122,000 in the input data. The initial dictionaries 
are empty for evaluation of adaptability of this method. The system translates 
each input sentence systematically. The results are evaluated by the translation 
rates to the number of input characters. The correct rate, erroneous rate and 
unfixed rate are calculated by each proportion to the number of input characters. 



3 Results 2 uid Considerations 

The rate of the correct translation increases as the input data increase. It shows 
that the system has acquired the words and improved. When the section of input 
data changes, the rate of correct translation decreases. The reason is that the 
number of the unregistered words in the words dictionary increases because the 
field of input data changes. However, the correct rate increases again because 
this system acquires words for the new fields. Thus, this system adapted to the 
new field immediately. The final rate of the correct translation is about 85[%]. 
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Abstract. This paper discusses the design and implementation of a 
serving agent for integrating soft computing and software agents. 



1 Introduction 

When we build agent-based application systems, we need to integrate the sym- 
bolic and computational intelligence concepts and technologies for the design 
of truly robust, flexible and adaptive intelligent systems. One of the important 
computational intelligence technologies is known as soft computing (SC). The 
principal members of SC are fuzzy logic (FL), neural network (NN), genetic 
algorithm (GA) etc. The SC technologies such as FL, NN, and GA are comple- 
mentary rather than competitive. Thus, it is necessary to equip a multi-agent 
system with different kinds of SC techniques, the more, the better. The problem 
left is how to integrate these SC techniques into multi-agent systems. 

In recent years an increasing number of researchers have been working in 
the field of hybrid systems in an attempt to find new ways to integrate two or 
more technologies to tackle complex real world problems[l]. Some of the research 
work such as the IMAHDA architecture[2], and the PREDICTOR system (see [1] 
Ch.9) etc. involved in multi-agent systems. The way for integrating SC technolo- 
gies and software agents (SA) in these systems is to embed the SC technologies 
in each individual SA. There are some limitations of such approaches: (1) It is 
impossible to embed many SC technologies within a single SA. Otherwise, the 
SAs will be overloaded; (2) It is not flexible to add more SC technologies to or 
delete some unwanted one from the software agent. 

Our idea for integration is to move the SC abilities from the front-end 
agents in a multi-agent system to the back-end as independent SC agents. We 
equip these SC modules with the communication ability using KQML (Knowl- 
edge Query and Manipulation Language), and under the support of JATLite 
(http://java.stanford.edu/java^gent). According to Genesereth’s statement: An 
entity is a software agent if and only if it communicates correctly in an agent 
communication language such as A(7i/[3], we call these SC modules agents. The 
emphasis of our work is trying to provide a universal approach to incorporate dif- 
ferent SC technologies into multi-agent systems. There are two notable features 
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front-end back-end 




ICQNIL Nlessage Exchange 4 Register / Connect 

Fig. 1. Architecture of the Soft Computing Agent Society 



in such a system: (1) Our approach shown here makes every problem solving 
agent easily access all the SC technologies available in the system; (2) The abil- 
ity to add and delete SC agents dynamically as needed. 

2 Modeling and Implementation of SC Serving Agent 

The architecture of such a system is shown in Figure 1. The internal structure 
of SC serving agent consists of KQML message interpreter (KMI), SC-Agent 
maintenance, and SC -Agent Jjist (database). The KMI represents the interface 
between KQML router and the SC serving agent. The SC agent list maintenance 
module has three functions: Add a node which contains the SC agent’s name, 
ability, and ontology to the list (database); Delete a node from the list; and 
retrieve the list to find out SC agents with specific ability. Actually, we combined 
its implementation with KQML message interpreter. 

3 Concluding Remarks 

A prototype has been implemented according to the architecture. There are two 
principal features that we mentioned Section 1. 

Our goal is to provide a platform independent soft computing support envi- 
ronment. By using this support environment, the multi-agent system developers 
need only to build the problem solving agents for a specific application. Then 
the problem solving agents can use all the SC technologies in the system when 
necessary. 
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Abstract. Visual geographic knowledge which can be extracted from satellite 
remote sensing images has characteristics which are not commonly found in 
non-visual domains. Traditionally geographic expert systems have worked 
either at the pixel level of raster images or the object level of vector images. 

This has shortfalls when knowledge acquisition from a human image 
interpreter has to be incorporated into an expert system to aid interpretation. 

A framework for the classification of visual geographic knowledge will be 
presented that expands beyond the traditional per-pixel model and has been 
used as the theoretical basis of a knowledge acquisition toolkit, KAGES 
(Knowledge Acquisition for Geographic Expert Systems) [2]. This model will 
be compared with the KADS knowledge model to show the relationship with 
modeling in a non-visual environment. 

Keywords: Expert systems, geometric or spatial reasoning, knowledge 
acquisition. 

1 A Proposed Classification of Geographic Knowledge 

Visual (including spatial) knowledge presents special problems for knowledge 
acquisition. Recognising visual features is easy for a human although the cognitive 
processing is complex. Describing those features without the use of diagrams is 
difficult. It is easy for a human expert to show what something looks like, but far 
more difficult to describe it in words, and more difficult again to describe it in terms 
of rules [3]. Given that geographic knowledge is visual and that domain experts work 
with images, a graphical system is required. The following classification is derived 
from and expands on those of McKeown et al [4] and Armstrong [1]. It is more 
rigorous and incorporates non-visual knowledge. It consists of six levels of 
knowledge which are: 

• Primitive Knowledge about the identification of scene primitives, a readily 
identifiable object which cannot be subdivided into smaller named entities. 

■ Relationship Knowledge of the spatial relationships between scene primitives in 
terms of their proximity, orientation and degree of overlap. 

• Assembly Knowledge, used to define collections of objects which form 
identifiable spatial decompositions. 
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• Non-Visual Knowledge which helps refine classifications developed using 
visual knowledge including labelling of scene primitives and spatial relationships 
consisting of temporal knowledge of how a scene changes over time, algorithmic 
knowledge, including how to combine bands and heuristic knowledge of a non- 
visual nature. 

• Consolidation Knowledge used to resolve and evaluate conflicting information. 

• Interpretation Knowledge of how to combine the other five types of knowledge 
to produce a classified image. 

2 Visual Knowledge and KADS 

KADS (Knowledge Acquisition and Development System) consists of a four-layer 
model of expertise [5]. In terms of the suggested geographic knowledge classification. 
Primitive, Relationship and Assembly Knowledge are forms of knowledge at the 
Domain Level under the KADS methodology as is Heuristic knowledge in the Non- 
Visual category. This knowledge could be used in a variety of different ways to 
produce products showing different aspects of an area covered by an image. 

Consolidation Knowledge on the other hand requires knowledge of how the rules are 
to be applied and is knowledge at the Inference level. Algorithmic Knowledge which 
contains knowledge of image band combinations and when they should be applied is 
also at the Inference level. 

The Task Level in the KADS system is represented by Interpretation Knowledge and 
shows how to apply the problem solving strategy to the whole image set. 

There is no equivalent of the Strategy Layer. In future extensions this would be 
knowledge of alternate ways of classifying images 
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1. Introduction 

Dynamic real-time scheduling applications are characterized by rapidly changing 
requirements and the need for timely response. We employ an approach combining 
Constraint Programming and Anytime Algorithms within a Multiagent framework. 
Agents implement various heuristic optimization algorithms which work concurrently 
on the same scheduling problem. A best solution is always available from some agent 
and agents collaborate to improve each others intermediate results. A multiagent 
resource allocation scheme for agents based on economic portfolio management is 
described. A report presenting the multiagent architecture and our early experiments 
using the prototype system is available [Havens et.al 99]. 

Our challenge has been to develop a viable planning system for Tactical Air Mis- 
sion Planning/Scheduling (TAMP/S) which can effectively schedule helicopter mis- 
sions under rapidly changing conditions in real-time. A mission is naturally viewed as 
a vehicle routing problem (VRP) [Christofides et al, 76] with time windows (VRPTW) 
[Solomon, 87] and capacity constraints. VRPTW problems are NP-hard in general. 
Given an incoming stream of aircraft mission requests, the problem is to assign both 
helicopters and flight crews to satisfy these missions. The task is also a multiple crite- 
ria optimization problem to m inim iz e delay, maximize utilization, et cetera. 

We require initially feasible solutions but desire much better suboptimal solutions 
as time and computing resources allow. Hence the application is an anytime planning 
problem. We describe here our multiagent approach to this application. Given the strict 
time constraints, we do not use complete exhaustive scheduling methods. Neither do 
we rely on traditional batch-oriented constraint solvers and optimization methods. 
Instead, we exploit two compatible approaches: 1) heuristic constraint optimization; 
and 2) multiagent systems. 

Heuristic constraint optimization methods can often find suboptimal solutions rap- 
idly for even very large real-world scheduling problems (Minton et al, 1990; Glover et 
al, 1993) but such performance cannot be guaranteed. As well, pure heuristic 
(“greedy”) methods often can find very good solutions in sublinear time. 
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Given heuristic search methods producing suboptimal solutions but without any 
performance guarantees, how can we improve the expected performance of the heuris- 
tic methods in an anytime framework? There exists a class of heuristic constraint opti- 
mization algorithms, called iterative repair techniques (IRT), which can incrementally 
improve a seed solution (perhaps from a greedy algorithm). Given good greedy algo- 
rithms to produce good seed solutions and a stable of good IRT methods, we have the 
opportunity to apply multiple methods (agents) in concert to the same scheduling prob- 
lem. Agents both cooperate and compete. Greedy agents assigned the same scheduling 
problem compete to produce better seed solutions. IRT agents assigned the same seed 
solution compete to produce better and better anytime solutions over time. IRT agents 
cooperate by using seed solutions produced previously by other IRT agents. 

2. Overview of the TAMP/S Architecture 

Our multiagent architecture is illustrated in the figure below which includes: 

1. a Representation Manager for creating 
new mission problems and assigning an 
initial greedy planning agent. 

2. an Environment Manager for dynami- 
cally modifying the constraints under 
which the planning agents operate. 

3. a Solution Manager for controlling the 
working set of anytime planning 
agents. 

Initially, the Solution Manager gives each 
greedy planner sufficient resources to com- 
pute a heuristic answer and each IRT plan- 
ning agent a seed and a fixed allocation of 
resources. All IRT planners report back 
their best solution found so far. The Solu- 
tion Manager then adjusts its resource allo- 
cations to the planning agents based on a market economy metaphor which measures 
the both the quality (price) of their present solutions and their rate of improvement 
(return). If an agent is performing badly relative to the other agents in the working set 
(portfolio) then it is discarded and another agent created in its place. 

3. Conclusion 

A prototype version of the TAMP/S has been implemented and tested using two greedy 
seed algorithms and a TABU search [Glover93] IRT method. Initial experiments sug- 
gest that the architecture is indeed appropriate for anytime scheduling applications 
where good (but suboptimal) solutions are acceptable and the environment (problem 
specifications and external situation) changes unpredictably over time. 
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Abstract. This paper describes a new approach for evolving recurrent 
neural networks using Genetic Programming. A system has been devel- 
oped to train weightless neural networks using construction rules. The 
network construction rules are evolved by the Genetic Programming sys- 
tem which build the solution neural networks. The use of rules allows 
networks to be constructed modulsuly. Experimentation with decompos- 
able Boole^ln functions has revealed that the performance of the system 
is superior to a non-modulsir version of the system. 



1 Introduction 

Genetic programming (GP) is a modification of the genetic algorithm which 
evolves variable-sized tree structures rather than fixed- length bitstrings. GP has 
been applied to neural network (NN) learning ([1,3]) and can automatically 
determine the necessary number of network neurons and the connection weights 
to solve the problem, as well as being able to produce recurrent networks. The 
GP-based Cellular Encoding system of Gruau [2] has shown the ability to exploit 
the decomposability of problems. 

This paper summarizes a novel GP-based system for evolving neural net- 
works (the GPNN system) that implement functions and can take advantage 
of decomposable problems. The system evolves weightless neural networks and 
evolves the activation functions of each neuron. Experimentation has demon- 
strated the ability of the system to find solution networks to test problems and 
shows that decomposable problems are efficiently learned. 

2 System Description and Results 

The GPNN system uses GP to evolve a collection or population of tree-structured 
rules. The rules, when read and executed in a left-to-right order, construct a 

* Bret Talko is now with the Defence Science euid Technology Organisation in Aus- 
tralia. The authors woiild like to gratefully ^lcknowledge the support of the Agent 
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neural network. The GP system must find a set of rules which when executed 
constructs a neural network that solves the specified problem. Rules exist to add 
a neuron, add a connection, and to change the activation function of neurons. 
Full details of this method can be found in [4]. 

The GPNN system uses weightless networks. The operation of the network is 
determined solely by the activation functions of the neurons and the connections 
between the neurons. The activation functions are represented and evolved as 
arithmetical expression trees. 

Networks are situated on two-dimensional grids, with neurons being located 
at grid vertices. Initially only the input and output neurons are on the grid, 
along with an initial neuron that is used to grow the full network. Each neuron 
is assigned a class number. Multiple neurons may share the same class. Rules 
specify which classes of neurons will apply the rules to themselves. By having 
multiple neurons share the same class, a rule can be simultaneously executed by 
multiple neurons. This ability is particularly useful for forming modular networks 
that have multiple copies of subnetwork modules. 

Experiments using the system for learning 16 decomposable Boolean func- 
tions have been carried out. The functions have 4 input variables and two output 
variables, are decomposable in the sense that they are formed as an amalgama- 
tion of two identical smaller functions having two input variables and one output 
variable. 

A non-modular system was used whereby rules can only affect at most one 
neuron instead of multiple neurons. 

The performance of the modular system was found to be better than or equal 
to the performance for the non-modular system. 

3 Conclusion 

This paper presented a new approach to evolving neural networks. Experimental 
results indicate it is suited to solving decomposable problems efficiently. 
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Generally, having specified the direct effects of an action, approaches for han- 
dling the ramification problem in reasoning about action and change favour the 
the use of domain constraints to infer any additional indirect effects. This usu- 
ally also requires the augmentation of the domain description with a dynamical 
specification, both syntactically and semantically, in order to accurately model 
the direction of the causal forces that govern the dependencies between various 
parts of the system. 

In this paper we deviate from the convention that causal rules act to restore 
the integrity of the static constraints describing the system and argue, in a spirit 
similar to Denecker et al. [1], for the de-coupling of the causal rules describing 
the dynamical aspects of a system from a static domain description. Moreover, 
we argue for the dynamical nature of causation and extend propositional logic 
into a simple, abstract language for modelling the causal dependencies of dynam- 
ical systems, furnishing an underlying state transition semantics in the spirit of 
Thielscher [6] and Sandewall [5]. 

For example, consider the simple act of holding a pebble. If we release the 
pebble it will fall; coming to a rest on the floor. This intuition can be readily 
captured if we denote that ‘the pebble is being held’ by h, and that it is ‘elevated’ 
by e, and supply the constraint: e h. 

If we modify our scenario by replctcing the pebble with an egg we would, 
nonetheless, wish to model this system in terms of e and h, baaed on the symme- 
tries in the domains and the desire to preserve the modularity of the description. 

The difficulty arises in augmenting the system description to establish a static 
constraint that would lead to the egg breaking in terms of the properties e and h. 
We argue that neither e Ah b, reflecting that an egg released while elevated 
will break, nor e A — > 6, with the intuition that on reaching the ground the egg 
will break, offer accurate and faithful descriptions of the system. In particular, 
the second alternative, while reflecting our intuition dynamically, does not do so 
correctly in a static context, as provided above. 

To model such systems we propose a trajectory-based approach in which the 
history of states through which a system evolves, in terms of a chain of indirect 
effects, is used to model the dynamics of a domain. We proceed to argue for the 
many-fold advantages of such an approach. 
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Firstly, we need not augment the scenario above (e.g., by adding a fluent ‘/’ 
to indicate ‘falling’) to specify the dynamical behaviour of the system. Instead, 
we claim that the statement: 

(e e) A /i 6, 

with the intended interpretation: ‘If the system undergoes a transition from e to 
e (i.e., e ^ e) while the egg was not held (h), then the egg should break (5) in 
the ensuing state’; can concisely and accurately model the intended dynamical 
behaviour of the system where, otherwise, we encounter difficulties in obtaining 
a faithful static description. Moreover, by not introducing a ‘falling’ fluent /, 
as mentioned above, we avoid the possible introduction of indeterminacy and 
cyclic fluent dependencies that can appear when attempting to model transient 
properties (such as the event of ‘falling’). 

Secondly, we inherit all the modelling features of a state-transition based 
approach, such as the possible dependence of an action on the intermediate 
effects that a system has assumed on course to its current state. This provides us 
with the versatility to model systems that include both sequential (time evolving) 
and simultaneous/concurrent effects within a common framework. 

Thirdly, by augmenting a typical propositional language with the connective 
(-^), having the interpretation indicated above, we can, at an abstract level, 
provide an ontological basis for events (e.g., ‘falling’) by associating them with 
multiple states within a trajectory. This allows us to generalise state-transition 
based approaches to modelling dynamical systems by using the trajectory as 
the fundamental descriptive and operational entity through which to obtain the 
dynamical properties of a system. Moreover, it permits us to retain a minimal 
state description (in terms of the number of basic properties — or fluents — of the 
system) and use this to extrapolate the dynamical state of the system. 

Comparisons with respect to McCain & Turner [4] are also made while at- 
tempts to reconcile, translate or reformulate the approach outlined above with 
other formalisms in the literature (e.g., Denecker et al. [1], Gustafsson & Do- 
herty [2], Lin [3], etc.) are left for further investigation. 
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Abstract. Work in Bayesian student modelling is described, which 
draws on extensive einalyses of the way that students think about the 
decimal numeration system. The model wiU form the basis of an adaptive 
tutoring system. We present an initial model and discuss issues arising. 



This paper describes initial work towards the diagnostic component of a tu- 
toring system, based on Bayesian modelling, that will monitor students engaged 
in various tasks, including standard tests [2], teaching activities, and interactive 
games involving the use of decimal notation [1], and adjust its choice of a next 
activity for the student according to its current classification of the student. 

We briefly describe the extensive background data available that provide key 
parameters for the student model. 

The student model used in this work derives from a very detailed analysis 
of student thinking based on data from over 2500 Victorian school students [3]. 
Understanding decimals is a complex task, which requires bringing together a 
web of related ideas. There are many sources of confusion and many opportu- 
nities for partial knowledge to lead children to make errors systematically. In 
this domain, the ideas that children hold can be grouped into three major cate- 
gories: A, L and S, and several subcategories. The category of apparent experts 
(A) comprises students who generally can decide correctly which of two deci- 
mals is larger. Some of these students are indeed experts (subcategory ATE), 
others follow correct rules with little understanding. Others have misconcep- 
tions, for example believing that only the first two decimal places have meaning 
so that 2.4513 and 2.45 are precisely equal. This can be due to analogy with 
money (subcategory AMO). Students in the longer- is- larger (L) category gener- 
ally think that longer decimals are larger numbers than shorter decimals and so 
believe 2.14 is greater than 2.8. The various reasons for this, the subcategories 
of L, can be diagnosed by careful examination of student responses to various 
item types. 
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Figure 1 represents a partial model of the decimal domain. The longitudinal 
data and expert assessment provide the necessary (conditional) probabilities for 
the model. This model uses only two evidence nodes, corresponding to two types 
of test items (called Types 4 and 6), four category nodes (A, L, S, U), nodes for 
9 of the (eventual 15) subcategories (LRV, LGN, LUN, SDF, SRN, SUN, ATE, 
AUN, AFC) and 4 dummy nodes included for technical reasons. The priors used 
in our initial tests are based on data from 294 Grade 5 students. 

We conducted 16 tests with different patterns of static evidence using items 
of types T4 and T6. wich demonstrated that at the category and subcategory 
levels, the qualitative behaviour of the network was as intended. 

As we scale-up, we need to: (i) instrument the games to extract key subse- 
quences to use as additional evidence; (ii) explore the technology of dynamic 
belief networks to deal with learning over time; (iii) experiment with the effect 
of students accidentally not performing true-to-type; (iv) observe whether rare 
thinking patterns (with low priors) will ever be diagnosed and develop meth- 
ods to give more weight to evidence which has particular significance in certain 
circumstances. Our model building shows considerable promise and we are op- 
timistic about the next phase. 
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Co-evoIution is the term used to identify the process in nature in which two or more 
species interact so intimately that their evolutionary fitness depends on each other. 
Biological co-evolution has been the inspiration for a class of computational 
algorithms called co-evolutionary computing. Co-evolutionary design is an approach 
to design problem solving in which the requirements and solutions of design evolve 
separately and affect each other. A reconsideration of the purpose of the fitness 
function and its affect on convergence is necessary since the fitness function changes 
through the co-evolutionary cycles. The interactions between requirements and 
solutions of design may possibly add some new variables to both aspects of design, 
which redefines the search space for requirements and solutions as well as the fitness 
function. Based on the idea of mutualism, which is one of three types of co- 
evolution in nature, the interacting populations raise the level of fitness in both, 
rather than the two populations competing with each other or one population living 
off the other. 

Co-evolutionary design is characterised by having a search space of problem 
requirements and a search space of problem solutions. The algorithm has two 
phases, each one corresponding to a simple genetic algorithm. In the first phase, the 
problem space provides the basis for a fitness function used to evaluate alternatives 
in the design space. In the second phase, the solution space provides the basis for a 
fitness function used to evaluate the problem space. Each of the two phases of co- 
evolutionary design is a search process using a simple GA and unchanging fitness 
function. Therefore, each phase corresponds to one design focus and a change in 
phase indicates a change in focus. In co-evolutionary design we need to reconsider 
concepts of evolution, their counterpart in GAs and their meaning in the co- 
evolutionary design process. 

Fitness: Survival of the fittest in evolution has been translated as a fitness 
function in simple GAs. This fitness function is the basis for the comparison of 
alternative solutions. In design, when we let the definition of the fitness function 
change, the value of the fitness function can no longer serve as the basis for 
comparison for all alternative designs. The performance of individuals in the 
solution space can only be compared when they are evaluated using the same fitness 
function. This makes it difficult to compare the performance of solutions across 
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different phases of the co-evolutionary design process. The performance of 
individuals is used to determine which members of the population “survive”, or are 
selected to participate in the next generation of search. When searching a space, 
either the problem space P or the solutions space S, the performance is measured by 
how well the alternatives satisfy the focus. A focus for the search for a design 
solution is a function of the set of possible design requirements, and a focus for the 
design requirements is based on the current set of design solution alternatives. 

Convergence: Convergence in evolutionary algorithms means that the search 
process has led to the “best” design in terms of the specified fitness function. 
Convergence is typically the criteria for termination of the evolutionary search 
process. Since the fitness function in co-evolutionary design changes from one phase 
to another, the idea of convergence needs to be reconsidered. This requires a 
consideration of the purpose of co-evolutionary design as compared to evolutionary 
search. The purpose of evolutionary search is to find the best solution for a given 
environment, where the environment is effectively represented by the fitness 
function. The purpose of co-evolutionary design is to explore both the problem and 
solution spaces, allowing both to change in reaction to each other, until a satisfactory 
combination of a problem statement and solution state is found. The exploratory 
nature of the co-evolutionary process implies that the process should search until the 
potential for new ideas is reduced. We propose then, that convergence is not related 
to fitness, but to the similarity of the members of the population. A population in 
which there is little change in the genotypes of the members when compared to the 
previous population indicates that the search process has converged. 

Termination: The link between convergence and termination in evolutionary 
algorithms occurs because the convergence to the “best” solution indicates that the 
search should be terminated. In co-evolutionary design, convergence is determined 
for each phase of the search, that is, for a given focus, and following the convergence 
for one focus, another focus is determined and search commences in the other space. 
This indicates a separation of termination and convergence. We use termination to 
indicate when the co-evolutionary process should stop, and convergence to indicate 
when the search in a given space for a given focus should stop. One criterion for 
termination is the number of cycles of the co-evolution process. This criterion is 
equivalent to setting a time limit for the design process. Often, the time limit is a 
major criterion for signalling when exploration of changes in problem and solution 
should stop. Another criterion for termination is similar to the convergence criterion 
above - there are no new fitness functions being found. The significance of this 
criterion is that the algorithm is not able to identify a different focus for the design 
and therefore, new ideas have been exhausted. 
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1 Introduction 

The purpose of this paper is to show a proposal of generating artistic calligraphic 
fonts. Japanese calligraphic art is to write brush characters by using a writing 
brush and black ink. Also, it is very popular art in Japan, China and so on. 
Needless to say, the brush characters are hand writing characters. The artists’ 
pen (brush) pressure, pen speed, and ink quantity are very important factors for 
the artistic effect. A calligraphic character consists of several brush strokes. In 
Japanese calligraphic art, the artistic effect appears in the compositions, which 
are the position of each brush stroke and the stroke shape itself, but here we 
limit the discussion to scratched and blurred look of calligraphic characters. 

Each of scratched and blurred look plays a very important role on brush 
strokes as essential artistic effect. Scratched look appears on brush strokes, when 
black ink sometimes does not reach paper from a writing brush. Scratched look is 
a white part on brush strokes. Similarly, blurred look appears on brush strokes, 
when black ink runs spread to paper from a writing brush. Blurred look is blurred 
black part with ink. 

2 System Overview 




Fig. 1. a process of generating artistic fonts by the system 
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We will show the system which adopts our proposal method to generate artistic 
fonts. Here is a figure which shows the system overview. See Figure 1. Our 
proposal method is divided into the following four stages. 

I. An original calligraphic font is inputted, and it is translated into a bitmap 
image format. We call it an original bitmap font. 

II. The skeleton of the original bitmap font is detected. We employ the line- 
thinning algorithm by [Hil69][NESI98] for this task. It goes without saying 
that the skeleton is bitmap image and represented by a set of black pixels. 

III. Brush-touch cursors are placed on each black pixel of the skeleton, and the 
artistic font with scratched or blurred look is completed. 

IV. The artistic font is displayed and outputted. 

The brush- touch cursors are represented as bitmap patterns (black and white). 
The black part of the brush-touch cursor expresses ink-colored part of calli- 
graphic fonts, and while the white part expresses colorless part of calligraphic 
fonts, or rather, it expresses scratched look. 

3 Experimental Results and Future Work 

Below is a series of examples showing the variety of scratched or blurred look 
except the extreme left one. In case artistic fonts were need, it had needed 
to prepare data of artistic fonts until now, but it would have cause huger font 
database. The system solves the problem, and moreover it generates various type 
of artistic fonts. This system needs a lot of processing. The processing depends 
on the size of an input font, and at present we are fixing the problem. 




Fig. 2. output examples except the extreme left input font 
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Abstract. The CDL network [1] produces too many prototypes. In this 
paper, we reduce the number of prototypes by using a competitive learn- 
ing algorithm and adding a re-labeling procedinre to obtain a correct 
labeling of clusters. 



1 Our Model 

In the training phase. If s(z^, x*) [1]> ^ for only one prototype and any input 
pattern x\ then is updated with the following competitive learning formula: 

z^ = z^ + a(x* — z^) (1) 

where a is a learning parameter with a value between 0 and 1. Figure 1(a) shows 
the prototypes obtained by the CDL network and our network, respectively. 
Black rectangulars represnt prototypes obtained by the CDL network and ellipses 
reprsent prototypes obtained by our network. 
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(a) Prototypes. (b) Re-labeling. 

Fig. 1. Description of prototypes generated and re-labeling process. 
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We add re-labeling in our model to let two clusters that are similar enough 
be labeled with the same label. To illustrate how re-labeling works, we show an 
example of 19 patterns they belong to the same cluster in Figure 1(b) as follows: 

Step 1. Four patterns are clustered together and labeled as 0. These patterns 
are placed in Xd- 

Step 2. Thirteen patterns are clustered and labeled as 1, and then placed 
in Xci. 

Step 3. Now we check whether there exist a pair of re-labeling patterns be- 
tween cluster 1 and cluster 0. A pair of re-labeling patterns are two patterns 
which are sufficiently similar to each other. In this case, relabeling patterns, 
marked as upright triangles, are found between cluster 1 and cluster 0. So we 
re-label cluster 0 to be 1. 

Step 4. Two patterns are clustered and labeled as 2, and are placed in Xd- 

Step 5. Now we check if there exist a pair of re-labeling patterns between 
cluster 2 and any cluster in Xd- Re-labeling patterns, marked as downright tri- 
angles, are found between cluster 2 and cluster 1. Therefore, we re-label cluster 1 
to be 2. 

2 Experiment 

To demonstrate the effectiveness of our method, we show the result of a data set 
of 400 patterns shown in Figure 2, where Co=49, q= 0.9, and 7=0.9. Figure 2(a) 
shows the data set and Figure 2(b) shows the results 



sloSo. 

«S| 






no t ■ ■■ SI 

I ■ ■ • ■ ai — ~ 

■■ 



:»Sc': 

:-x:: 





Our Method 


CDL network 




QjS 


■imiM 


lilTOl 


0.01 








rotaa 


0.90 


no. of clusters 


■fl 


4 


U! 


4 


no. of prototypes 




■Itf 




289 


cluster 1 


Eia 


98 


EM 


98 


cluster 2 


wm 


97 




98 


cluster 3 




99 


■iini 


99 


cluster 4 


E£| 


98 




99 



(a) Data set. 



(b) Results. 



Fig, 2. Description of experiment 1. 
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The development of standardised product and process models for the building and 
construction industry has now reached a stage where collaborative design is feasible. 
The challenge comes from the appropriate adoption of emerging technologies to 
support advanced data interoperability at different levels of granularity. 
Interoperability is the enabling mechanism that allows information to be exchanged 
between collaborative systems. We focus on advanced coordination between the 
design system (e.g. CAD system) and the building code checking system based on the 
Building Code Australia (BCA). It will enable design tasks (e.g. drawings) produced 
within a CAD system to be automatically processed by an external system, e.g. 
building code checking system. One technical difficulty concerned is CAD and BCA 
objects recognition. The process covers the information flow from a CAD system to 
the code checking system. It contains the events and activities taking place within 
each separate CAD and compliance checking system, and through the communication 
channels between the two systems. The code checking system needs the recognition 
of CAD objects such as doors, walls and passageways and their relationships. This 
information is already available in the new generation of CAD systems in an implicit 
form. BCA objects may further be substantiated (e.g. through mapping) within the 
compliance checking system based on the incoming CAD objects according to 
building code requirements. Recognition of these objects requires inference 
techniques. There are two levels of inference occurring within the process. The first 
level is across the communication link between the CAD application and the 
compliance checking application. This derives the BCA objects from the CAD objects 
and maintains the mapping between them. The second level of inference is in the 
application of the BCA rules. There are several software modules in integrated design 
systems, e.g. Design Task Representation Module (DTRM), System State Module 
(SSM), Knowledge Representation Module (KRM) and Design Task Interpretation 
Module (DTIM). 

In DTRM, design tasks are desaibed in terms of data objects. A data object has an 
arbitrary number of properties. The features of these properties are their simplicity in 
structure, generality for different tasks and flexibility for linking with global 
representation standards, such as the international STEP standard, and existing research 
activities in the domain of artificial intelligence for design, e.g. the FBS model [1]. For 
this application, it satisfies various requirements from a building design perspective. 
The code checking system has been implemented to work cooperatively with external 
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systems such as CAD systems to provide integrated solutions. The design task 
interpretation module is thus designed to be capable of converting external design task 
information into a processable form within the code checking system. In SSM, the 
system state information is the system’s internal understanding of its problem-solving 
status, e.g. what has been achieved and w4iat is to be done at a particular point in time, 
and how information can be derived from design tasks and users (through 
human/computer interaction). The SSM offers a set of operation routines to provide 
communication services with external systems for information exchange, e.g. CAD 
systems on building design-related applications. There are two types of BCA rules in 
the knowledge base: auxiliary rules and core rules managed by IQIM. Auxiliary rules 
are used for verifying the presence of all the conditional elements of a BCA clause, and 
are optional. Core rules deal with the complying information of that clause, i.e. the core 
semantics of the clause, and are compulsory. The DTIM responds to any external events 
initiated from CAD systems by constructing a unique task model in the memory driven 
by the knowledge base. DTIM is based on event-driven inference strategy. Upon 
receiving a design task (an event), the task information is decomposed into subtasks 
until regional design areas are identified. Regional designs are the minimum design 
areas where the B^ can become effective, and are classified according to BCA rules 
from the knowledge base. The major advantage of dynamic design task modelling is its 
flexibility and cost effectiveness in delivering solutions, e.g. overcoming the bottleneck 
of total building modelling. 

Concerning application impact, BCAider [2] is a commercial product developed at 
CSIRO for building compliance checking. It is a world-leading product but suffers from 
the fundamental drawback that a human must answer questions. The power of the new 
compliance checking system comes from its design ^owledge base. One important 
feature of this knowledge base is that it is capable of describing regulation constraints at 
the design level, thus making it feasible for the building code checking system to be 
integrated with a design system such as CAD systems. The system components were 
written in Visual C++. The computing platforms are PC 486 onwards with Windows 95 
or Windows NT. 

Acknowledgments 

The authors wish to thank CSIRO colleagues who have contributed to this work at 
various stages, particularly Kevin Gu, Mike Rahilly, Michael Ambrose and John 
Mashford. 

References 

1. Gero, J.: Design Prototypes: A Knowledge Representation Schema for Design. AI Mag. 

11(4) (1990) 26-36. 

2. Sharpe, R., Oakes, S.: Advanced IT Processing of Australian Standards and Regulations. 

Int. J. Constr. Inform. Tech. 3(1) (1995) 73-89. 




Information-Based Cooperation in Multiple 
Agent Systems 
Extended Summary 



Yuefeng Li and Chengqi Zhang 

School of Computing and Mathematics 
Deakin University, Geelong VIC 3217, Australia 
{yuefeng , chengql}Qdeakin . edu . au 



Abstract. In this paper, we divide the cooperation in multiple agent 
systems into task-based cooperation emd information-based cooperation. 
We describe information-based cooperation as information synthesizing 
and decision making. To implement information synthesizing, agents’ 
beliefs are decomposed into two levels. The higher level represents the 
possible information, and the lower level subsequently estimates a num- 
ber function for the belief by synthesizing the possible information. 



1 Information-Based Cooperation 

There are two distinct research fields in distributed artificial intelligence (DAI), 
distributed problem solving (DPS) and multiple agent systems (MAS). In DAI, 
the central goal is problem solving, and there are several scenarios, such as 
task decomposition, task unique-allocation and task multi-allocation. Based on 
these scenarios, we can divided the cooperation into task-based cooperation and 
information-based cooperation. 

The sender agent knows which agents have the problem solving abilities 
(PSA) to do its subtasks in task-based cooperation (often within the area of 
DPS). As soon as it receives the answers about its tasks, the agent will treat these 
answers as gotten by itself. The situation is quite different from information- 
based cooperation (often within the area of MAS). The sender agent cannot be 
sure which agents have the PSA to do its tasks. Because it always assumes that 
other agents will help one another only when it is in their own best interests to 
do so, and in most cases the solutions sent by different agents for a same task 
are always different and even contain uncertainty information. 

For task-based cooperation, one common approach is to use a central co- 
ordinator (agent) which has a centralized planning. In task-based cooperation, 
agents lose the autonomy, because the society of agents is just like a industry 
workshop or a management organization. Autonomy is an important character- 
istic in MAS. There are several notable efforts for describing autonomy, e.g., 
artificial social systems, social laws, establish cooperation through the formal- 
ization of agents’ intentions, reaching consensus by negotiation or by economic 
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decision process, and building ideal rational agents. For this kind of coopera- 
tion, the information-based cooperation is important, because the information 
an agent receives or sends is not only based on its belief, but also its inter- 
ests. At this stage, agents could not expect to obtain certain information from 
other agents, and they have to synthesize (including conflict resolution) some 
uncertainty information before making decisions. 



2 Information Synthesizing in MAS 



The information mainly comes from two streams, one from agents’ local sensors, 
and another from other agents. In order to capture these information properly, 
we represent an agent’s belief as two levels. The higher level represents the 
possible information about the possible worlds, and the lower level subsequently 
estimates a number function for the belief. 

A pair B = FX) is used to describe the higher level when agent A’s 

local state is I, where, PIF4 is a mapping: PWj^, : —* 2^ , which describes 

the knowledge of how the agent explains its local states. If the local state is 
/ € (the set of agent ^’s local states), the agent believes that PW_\{1) C W 
(the set of possible worlds) is the plausible world (i.e., the true world state is in 
PWa{1))- Pa Is another mapping: Pa ■ S 2^ — {0}, which represents the 
information (opinion) provided to agent .4 by other agents (H, the set of other 
agents). 

In the lower level, firstly the information provided by other agents can be 
captured reasonable by using a Dempster-Shafer mass function, which satisfies 

mA • 2^^ — > [0,1], such that 




PrA{{^ e B \ Pa(0 = S}) if 5^0 
0 otherwise 



for all S C W. Where, we use a probability distribution PrA on B to describe 
the degrees of agent A believing other agents’ PSAs. 

Based on the local state I, then the agent will generates its belief by combin- 
ing rriA and its observation, PWa{1), I € Ca- In this paper, we use Dempster 
rule of conditioning to synthesize the two kinds of information: 



: 2^ — > [0,1], such that 



b!^{s) = 



'^Tnpw 

^~^TnPW^(i)=0 

0 



if 5^0 
otherwise 



Function is still a Dempster-Shafer mass function. 

We can prove that ^Tnpw^(i)=s S I Pa{^)PPWa{1) — 
S}), and, and 1 - ETnPW^(i)=0 == 1 ~ PtaH^ e S' | Pa{C) n PWa{1) = 

0}), where set {^ € S | Pa(0 ^ PWa{1) = 5'} contains all agents whose opin- 
ions do not completely conflict with agent ^’s observation, and set {^ € S j 
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^ P^a{^) = 0} contains all agents whose opinions completely conflict 
with agent ^’s observation. So function synthesizes all opinions of these 
agents whose opinions do not completely conflict with agent >t’s observation 

PWa{1). 

3 Conclusion 

We have formalized a method for information-based cooperation in multiple 
agent systems. The new method can synthesize (including conflict resolution) 
the information which comes from agents’ local sensors and other agents by 
using Dempster-Shafer theory. 
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